Saining Xie@sainingxie

2026-05-27 10:12·37天前

AI 摘要

推文介绍了Cambrian-P，这是一个原生集成相机位姿的多模态大语言模型。其核心观点是，相机位姿是一种易于获取且足以支撑鲁棒视频理解的最小3D信号。通过联合建模视频帧与位姿，模型能将图像序列转化为全局结构化的表示。引用推文指出，当前多模态大语言模型在识别视频活动方面表现优异，但对视频中的空间结构及自主体/物体动态的理解仍然不足，而相机位姿信息是弥补这一差距的关键缺失环节。

📸latest in our cambrian series： cambrian-p， p for pose. i think pose is probably the minimal sufficient 3d signal （and it's easy to get！） that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.

Jihan YangCamera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics...

多模态论文/研究

在 X 查看原推导出 Markdown

Saining Xie@sainingxie · X

69导出 Markdown

2026-05-27 10:12·37天前

在 X 看原推· x.com

AI 摘要

Jihan YangCamera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics...

多模态论文/研究