# RoboLab：用于任务通用策略分析的高保真仿真基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo7cv3t200gvslml4weybtba
- 原文链接：https://arxiv.org/abs/2604.09860

## AI 摘要

RoboLab 团队推出高保真仿真基准框架 RoboLab 及 RoboLab-120 测试集，以解决现有基准因训练评估域重叠导致的性能饱和与泛化测试不足。该基准包含 120 个跨视觉、程序、关系三大能力轴的任务，设三个难度级别，支持人工与 LLM 生成场景。通过量化真实策略在受控扰动下的性能与敏感性，RoboLab 证实高保真仿真可代理真实世界表现，并暴露当前最先进模型的显著性能差距。

## 正文

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which external factors most strongly affect that behavior under controlled perturbations. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a physically realistic and photorealistic simulation. With this, we propose the RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational competency, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, indicating that high-fidelity simulation can serve as a proxy for analyzing performance and its dependence on external factors. Evaluation with RoboLab exposes significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies.
