# LARY：用于可泛化视觉-动作对齐的潜在动作表示基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-13 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnzj0j7t03oksl0fh8vo81ho
- 原文链接：https://arxiv.org/abs/2604.11689

## AI 摘要

研究团队发布LARY基准，统一评估潜在动作表示在高层语义动作与低层机器人控制上的性能。该基准整合100万段视频（1000小时）涵盖151个动作类别，以及62万图像对和59.5万条运动轨迹。实验表明，未经动作监督训练的通用视觉基础模型持续优于专门的具身潜在动作模型，且潜在视觉空间比像素空间更适配物理动作空间。这证实通用视觉表示已内在编码物理控制所需的行动知识，语义级抽象是比像素级重建更有效的视觉-动作映射路径。

## 正文

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.