Saining Xie@sainingxie

2025-11-27 04:19·218天前

AI 摘要

H*项目突破传统MLLMs处理单一2D图像的局限，引入全景图像作为环境载体，使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具，H*通过"具身化"范式赋予模型类似人类颈部的视角自由度，显著扩展了行动空间，支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理，实现了从被动接受到主动探索的范式转变。

after V*， many projects tried to get MLLMs to `think with images'， but a regular 2d image limits you to mostly basic tools like zooming or cropping.

to expand the action space， we need something more embodied. that is where H* from @YimingLi9702 and his team comes in. It takes a panoramic image as the environment. instead of staring at one image， the model can look around and think in 360.

it is basically giving the model a neck！

with that freedom， it can choose from many more actions and think inside real spaces like nyc train stations or shopping malls！

Yiming Li🤔Visual-spatial reasoning requires a shift from a disembodied, passive paradigm to an embodied, active one: 🤖Grounding V* in humanoid agents! 🚀Introducing H*...

具身智能多模态论文/研究

在 X 查看原推导出 Markdown

Saining Xie@sainingxie · X

导出 Markdown