# 数据受限语言模型预训练的训练时数据增强解析

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-19 08:00
- AIHOT 分数：40
- AIHOT 链接：https://aihot.virxact.com/items/cmqqrfwu40cpxslp5sifcy6ld
- 原文链接：https://arxiv.org/abs/2606.16246

## AI 摘要

针对数据受限、算力充裕场景下标准自回归预训练多轮迭代后严重过拟合的问题，研究引入三类正交训练时数据增强：token级噪声（掩码、随机替换）、序列重排（从右至左预测、Fill-in-the-Middle）及目标偏移预测（预测x_{t+i}, i>1）。消融实验表明，单项增强均能延缓过拟合并降低验证损失，其中随机替换效果最优；组合多种增强可进一步降低最小验证损失。该方法有效缓解了自回归预训练在固定语料上重复训练时的数据效率低下问题。代码与数据已开源。

## 正文

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.
