Autodata：让AI智能体成为数据科学家，自动构建高质量合成数据

2026-06-24 08:00·9天前

AI 摘要

Autodata是一种通用方法，使AI智能体扮演数据科学家角色，自主构建高质量训练与评估数据。该方法支持对数据科学家智能体进行元优化，使其学会生成更优数据，具体实现为Agentic Self-Instruct。在计算机科学、法律推理及数学对象推理等任务上的实验表明，Autodata生成的合成数据集质量优于经典方法，且对智能体进行元优化能带来更显著的性能提升。该方向通过将推理计算转化为更高质量的训练数据，有望改变AI数据的构建方式。

原文 · 未翻译

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

HuggingFace Daily Papers（社区热门论文）

43导出 Markdown

Autodata：让AI智能体成为数据科学家，自动构建高质量合成数据

2026-06-24 08:00·9天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译