# 反事实链与因果图的大语言模型可解释性

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-04 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmq4uqh2101vrslt2aqv2lujs
- 原文链接：https://arxiv.org/abs/2606.05972

## AI 摘要

提出一种使用因果图解释大语言模型推理的方法，包含四阶段：发现类区分概念、将输入映射为LLM感知的概念状态、通过MCMC启发的反事实增强扩充稀疏观测数据、利用σ-CG进行稳定因果发现。在三个大语言模型上应用于疾病诊断、情感分析和LLM-as-a-judge分类任务。实验评估了因果图的预测保真度和结构稳定性，以及反事实增强的收敛性与下游效用。结果表明所发现的因果图捕获了与LLM推理一致的有意义依赖关系，为概念层面的可解释性提供了基础。

## 正文

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with σ-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.
