AC-ODM:Actor-Critic在线数据混合实现高效LLM预训练
阅读原文· arxiv.orgAC-ODM从强化学习视角出发,通过参数化策略实现动态数据混合,理论证明该策略作为线性代理最大化梯度正干涉。支持代理模式(小模型策略迁移至大模型)和非代理模式(无先验端到端训练)。在Pythia-1B上,相比基线用少66%训练步骤达到最优验证困惑度,MMLU准确率相对提升27.5%,HumanEval pass@1提升2.23倍,每步耗时仅增0.4%、内存开销仅增2%。代码已开源。
Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.