RepSelect：通过表示选择性实现鲁棒的LLM遗忘

2026-06-15 08:00·18天前

AI 摘要

现有LLM遗忘方法易被微调或少量提示逆转，原因在于目标表示与保留集及攻击者可恢复子空间共享，破坏通用能力且易反制。RepSelect在前向更新前坍缩权重梯度主成分，隔离遗忘集独有表示。在Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite四种模型上，针对生物危害知识和滥用倾向两类任务，与GradDiff等五个基线相比，RepSelect使重学习后答案准确率降幅比最强基线大4–50倍，对少量提示攻击近乎完全鲁棒。

原文 · 未翻译

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

HuggingFace Daily Papers（社区热门论文）

61导出 Markdown

RepSelect：通过表示选择性实现鲁棒的LLM遗忘

2026-06-15 08:00·18天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译