持续学习中Adam下梯度修改的隐藏失效模式及自适应解耦矩路由修复

2026-04-27 12:00·55天前·Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

精选理由

这篇论文揭示了持续学习中一个被普遍忽视的致命陷阱：所有主流梯度修改方法在Adam下都会失效。如果你在做相关研究或工程，这是必须了解的底层机制，否则你的模型可能在无声中崩溃。

AI 摘要

研究显示，持续学习中梯度修改方法（如投影、惩罚重缩放）与Adam优化器结合时存在隐藏失效。在8领域语言模型任务中，共享路由投影基线性能接近普通遗忘（12.5-12.8 vs. 13.2），而自适应解耦路由稳定在9.4，提升3.8单位；16领域任务中优势扩大至4.5-4.8单位。失效因Adam第二矩路径导致旧方向有效学习率膨胀，同样出现在惩罚方法、回放混合及70亿参数规模中。修复方案仅将修改梯度路由到第一矩，保持第二矩统计量，并采用重叠感知自适应强度，这是唯一能避免崩溃的配置。

原文 · 未翻译

Computer Science > Machine Learning

[Submitted on 24 Apr 2026]

Title:Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair

Authors:Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

View PDF HTML (experimental)

Abstract:Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks.
We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.

Comments:
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
ACM classes:	I.2.6; F.2.2
Cite as:	arXiv:2604.22407 [cs.LG]
	(or arXiv:2604.22407v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2604.22407 arXiv-issued DOI via DataCite

Submission history

From: Yuelin Hu [view email]
[v1] Fri, 24 Apr 2026 10:00:00 UTC (5,621 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2026-04

Change to browse by:

cs
cs.AI

References & Citations

Bookmark

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)

Connected Papers (What is Connected Papers?)

Litmaps (What is Litmaps?)

scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub (What is DagsHub?)

Gotit.pub (What is GotitPub?)

Hugging Face (What is Huggingface?)

ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)

Hugging Face Spaces (What is Spaces?)

TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)

CORE Recommender (What is CORE?)

IArxiv Recommender (What is IArxiv?)

Author
Venue
Institution
Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

数据/训练论文/研究部署/工程

arXiv：cs.LG（机器学习，全量分类）

精选78