语义运动锚点:弥合协同语音手势中的运动与意义
阅读原文· arxiv.org研究提出“语义运动锚点”方法,用于解决协同语音手势生成与检索中语义理解不足的问题。该方法将3D手势离散化为身体-手部运动原语,并转化为结构化的自然语言描述,作为辅助监督信号锚定于语音文本。在BEAT2数据集上,该方法将文本到手势检索的R@1指标提升了8.2%,并优于现有方法。检索增强的手势生成用户研究显示,该方法检索到的手势在传达交流意图方面明显优于基线。
Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.