GFT：基于无偏群组优势与动态系数修正的从模仿到奖励微调

2026-04-15 08:00·79天前

AI 摘要

针对大语言模型后训练中监督微调（SFT）与强化学习（RL）难以统一高效知识注入与稳健泛化的问题，研究人员提出Group Fine-Tuning（GFT）框架。通过训练动态分析发现，SFT实质是带有极稀疏隐式奖励和不稳定逆概率加权的策略梯度优化，易导致单路径依赖与梯度爆炸。GFT引入群组优势学习构建多样化响应群组以缓解奖励稀疏，并采用动态系数修正自适应限制逆概率权重稳定优化。实验表明，GFT持续超越SFT方法，且与后续RL训练衔接更顺畅。

原文 · 未翻译

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

GFT：基于无偏群组优势与动态系数修正的从模仿到奖励微调

2026-04-15 08:00·79天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译