# AutoLab：前沿模型能否解决长周期自动研究与工程任务？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmpyy97hn04hbsli3hs8q8o6x
- 原文链接：https://arxiv.org/abs/2606.05080

## AI 摘要

AutoLab是一个评估超长周期闭环优化能力的基准，包含36个专家设计的真实任务，覆盖系统优化、谜题挑战、模型开发和CUDA内核优化四个领域。每个任务从一个正确但刻意次优的基线开始，要求智能体在严格时间预算内迭代改进。对17个最先进模型的测试表明，成功关键在于持续进行基准测试、编辑和整合经验反馈的持久性。claude-opus-4.6展现出较强的长周期优化能力，但多数前沿模型要么过早终止，要么在预算内进展甚微。该基准、评估工具和任务工件已全部开源。

## 正文

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.
