单层Transformer即可匹配全参数强化学习训练:Qwen3/Qwen2.5等模型研究
阅读原文· arxiv.org研究发现,训练单个Transformer层即可恢复甚至超越全参数强化学习(RL)后训练带来的收益。研究引入“层贡献度”量化指标,在Qwen3和Qwen2.5两个模型家族的七个模型上,使用GRPO、GiGPO、Dr. GRPO三种RL算法,覆盖数学推理、代码生成和智能体决策任务,发现RL收益高度集中于少数Transformer层,且高贡献层集中在堆栈中间,两端层贡献显著较小。
Computer Science > Machine Learning
Title:Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
Abstract:Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL) |
| Cite as: | arXiv:2607.01232 [cs.LG] |
| (or arXiv:2607.01232v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2607.01232 arXiv-issued DOI via DataCite (pending registration) |
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
- Author
- Venue
- Institution
- Topic
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.