# Cambrian-P：用相机位姿增强视频多模态模型

- 来源：Saining Xie (@sainingxie)
- 发布时间：2026-05-27 10:12
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmpng6pc70xbesl01ty7a2dpk
- 原文链接：https://x.com/sainingxie/status/2059457581882470828

## AI 摘要

推文介绍了Cambrian-P，这是一个原生集成相机位姿的多模态大语言模型。其核心观点是，相机位姿是一种易于获取且足以支撑鲁棒视频理解的最小3D信号。通过联合建模视频帧与位姿，模型能将图像序列转化为全局结构化的表示。引用推文指出，当前多模态大语言模型在识别视频活动方面表现优异，但对视频中的空间结构及自主体/物体动态的理解仍然不足，而相机位姿信息是弥补这一差距的关键缺失环节。

## 正文

📸latest in our cambrian series： cambrian-p， p for pose.
i think pose is probably the minimal sufficient 3d signal （and it's easy to get！） that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.

### 引用推文

> Jihan Yang：Camera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics...