H*项目突破传统MLLMs处理单一2D图像的局限,引入全景图像作为环境载体,使模型具备在360度真实空间中主动观察与推理的能力。相比V*等项目的局部视觉工具,H*通过"具身化"范式赋予模型类似人类颈部的视角自由度,显著扩展了行动空间,支持在地铁站、商场等复杂场景中进行视觉搜索与空间推理,实现了从被动接受到主动探索的范式转变。
after V*, many projects tried to get MLLMs to `think with images', but a regular 2d image limits you to mostly basic tools like zooming or cropping.
to expand the action space, we need something more embodied. that is where H* from @YimingLi9702 and his team comes in. It takes a panoramic image as the environment. instead of staring at one image, the model can look around and think in 360.
it is basically giving the model a neck!
with that freedom, it can choose from many more actions and think inside real spaces like nyc train stations or shopping malls!