Meta提出Autodata,将合成数据生成视为智能体数据科学家的任务。核心方法“Agentic Self-Instruct”让AI智能体生成并元优化合成训练与评估数据。循环流程:生成示例→弱模型与强模型分别尝试→判断结果→修订配方直至示例处于有用区间。论文强调难度不是美德,示例应针对弱模型的学习点。关键结果:在法律任务上,4B模型训练后超越了更大的397B基线。
Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data.
The main result is that agent-made data usually trained models better than standard synthetic data, and in legal tasks a trained 4B model beat a much larger 397B baseline.
Treats synthetic data generation as a job for an agentic data scientist, not a prompt template.
"Agentic Self-Instruct," makes AI agents generate and meta-optimize synthetic training and evaluation data, improving performance over classical synthetic data methods across CS, legal, and math benchmarks.
Autodata's loop is simple: generate an example, let a weak model and a strong model try it, judge the results, then revise the recipe until the example sits in the useful zone.