真实场景中的对比归因：针对现实基准测试的 LLM 失败可解释性分析

2026-04-20 08:00·74天前

AI 摘要

研究团队提出了一种基于对比归因与 LRP 的 LLM 故障诊断框架，通过量化错误输出与正确候选间的 logit 差异，并将其归因至输入 token 及内部模型状态，同时支持长文本的跨层归因图构建。该研究在多个真实基准测试上开展系统实证，覆盖不同数据集、模型规模及训练阶段，结果显示 token 级对比归因虽能为部分失败案例提供有效诊断信号，但其适用性存在明显局限，尚无法通用于所有场景。

原文 · 未翻译

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as contrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

真实场景中的对比归因：针对现实基准测试的 LLM 失败可解释性分析

2026-04-20 08:00·74天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译