HUG:基于流匹配的通用人类抓取模型
阅读原文· arxiv.org研究人员提出HUG,一种基于流匹配的模型,能从单张RGB-D图像生成多样化人类抓取姿态。团队利用智能眼镜收集了1M-HUG数据集(100万帧、27.8小时、6707个物体实例)。HUG融合RGB与深度观测,输出手腕平移、手腕旋转和MANO手部姿态,并可重定向至多种机器人手,实现零样本抓取。为标准化评估构建了HUG-Bench,含90个未见过物体(5种几何类别)。在30物体真实测试集上,HUG比SOTA基线高出23%和34%。代码、数据、基准、模型检查点和交互演示已发布。
Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/