Curation-Bench：通用智能体能否自动化数据筛选？

2026-06-02 08:00·31天前

AI 摘要

Curation-Bench 是一个面向智能体的基准，固定模型、训练配方和评估套件，赋予智能体命令行权限以检查数据、实施策略并提交训练/评估管道进行迭代。在视觉语言指令微调场景中，开箱即用的智能体在十次迭代内即可达到强数据选择基线。但轨迹分析显示存在执行-研究差距：智能体主要调整局部策略变体，而非探索新策略族。脚手架要求每次迭代引用、实例化并改编先前方法，引导智能体进行方法导向探索。最终脚手架化的智能体自主组合出数据选择策略，以十分之一的数据预算超越了强基线。代码和基准已开源。

原文 · 未翻译

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

HuggingFace Daily Papers（社区热门论文）

50导出 Markdown

Curation-Bench：通用智能体能否自动化数据筛选？

2026-06-02 08:00·31天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译