持续学习中Adam下梯度修改的隐藏失效模式及自适应解耦矩路由修复
这篇论文揭示了持续学习中一个被普遍忽视的致命陷阱:所有主流梯度修改方法在Adam下都会失效。如果你在做相关研究或工程,这是必须了解的底层机制,否则你的模型可能在无声中崩溃。
研究显示,持续学习中梯度修改方法(如投影、惩罚重缩放)与Adam优化器结合时存在隐藏失效。在8领域语言模型任务中,共享路由投影基线性能接近普通遗忘(12.5-12.8 vs. 13.2),而自适应解耦路由稳定在9.4,提升3.8单位;16领域任务中优势扩大至4.5-4.8单位。失效因Adam第二矩路径导致旧方向有效学习率膨胀,同样出现在惩罚方法、回放混合及70亿参数规模中。修复方案仅将修改梯度路由到第一矩,保持第二矩统计量,并采用重叠感知自适应强度,这是唯一能避免崩溃的配置。
Computer Science > Machine Learning
Title:Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
View PDF HTML (experimental)Abstract:Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks.
We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.
| Comments: | |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI) |
| ACM classes: | I.2.6; F.2.2 |
| Cite as: | arXiv:2604.22407 [cs.LG] |
| (or arXiv:2604.22407v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2604.22407 arXiv-issued DOI via DataCite |
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
- Author
- Venue
- Institution
- Topic
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.