# EBench：通用移动操控策略的细粒度诊断基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-20 08:00
- AIHOT 分数：45
- AIHOT 链接：https://aihot.virxact.com/items/cmqszl6hy07r3slfu7pvzbhsn
- 原文链接：https://arxiv.org/abs/2606.18239

## AI 摘要

EBench是一个模拟基准，用于细粒度诊断通用移动操控策略的能力，而非仅评估单一成功率。它包含26个多样化任务，沿5个能力维度和4个泛化维度标注。评估了π₀、π₀.₅、XVLA和InternVLA-A1等当前最先进的通用操控模型，发现成功率相近的模型能力画像存在显著差异：π₀.₅测试成功率最高且训练-测试保留最佳；InternVLA-A1在移动操控任务上占优，但在灵巧任务上崩溃；XVLA在原子技能上与其他策略互补。EBench还从4个代表性角度分析泛化能力，揭示了不同分布偏移因素的影响。

## 正文

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.