RepSelect:通过表示选择性实现鲁棒的LLM遗忘
阅读原文· arxiv.org现有LLM遗忘方法易被微调或少量提示逆转,原因在于目标表示与保留集及攻击者可恢复子空间共享,破坏通用能力且易反制。RepSelect在前向更新前坍缩权重梯度主成分,隔离遗忘集独有表示。在Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite四种模型上,针对生物危害知识和滥用倾向两类任务,与GradDiff等五个基线相比,RepSelect使重学习后答案准确率降幅比最强基线大4–50倍,对少量提示攻击近乎完全鲁棒。
Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.