# Autodata：让AI智能体成为数据科学家，自动构建高质量合成数据

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-24 08:00
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmqsxg0qd075zslfujlwe6xd4
- 原文链接：https://arxiv.org/abs/2606.25996

## AI 摘要

Autodata是一种通用方法，使AI智能体扮演数据科学家角色，自主构建高质量训练与评估数据。该方法支持对数据科学家智能体进行元优化，使其学会生成更优数据，具体实现为Agentic Self-Instruct。在计算机科学、法律推理及数学对象推理等任务上的实验表明，Autodata生成的合成数据集质量优于经典方法，且对智能体进行元优化能带来更显著的性能提升。该方向通过将推理计算转化为更高质量的训练数据，有望改变AI数据的构建方式。

## 正文

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.