# RoboSemanticBench： 诊断VLA模型动作预测中的语义对齐

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-01 08:00
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmpw3ayof03zesluktn6xx7ze
- 原文链接：https://arxiv.org/abs/2606.02277

## AI 摘要

本文提出了RoboSemanticBench，一个用于诊断视觉-语言-动作模型在动作预测中是否具备语义对齐能力的具身基准测试。在该测试中，机器人需要解决多选题，并根据语义理解抓取对应正确答案的方块。测试覆盖了算术、数学理解和常识理解等多种任务。评估发现，在控制抓取成功率后，许多模型选择语义正确方块的能力接近或低于随机水平，揭示了模型骨干网络的语义能力与最终动作预测之间存在持续差距。

## 正文

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.