# 基于激活补丁技术的LLM知识遗忘深度测量

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-23 08:00
- AIHOT 分数：52
- AIHOT 链接：https://aihot.virxact.com/items/cmpwkgd8m03pqslsnv25vtqrd
- 原文链接：https://arxiv.org/abs/2605.24614

## AI 摘要

大语言模型的知识遗忘是实现隐私保护和AI安全的关键机制，但现有评估方法难以验证目标知识是否从模型内部被真正擦除。本文提出了一种新的度量指标UDS，用于量化遗忘的机制深度。该方法首先在保留模型上定位编码目标知识的层，然后在遗忘后模型上评估其擦除程度（0-1分）。在涵盖8种方法、150个遗忘模型的元评估中，UDS的可靠性与稳健性表现最佳。研究还揭示了不同白盒度量在层级评估上可能存在差异。代码与数据已开源。

## 正文

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score
