# CausaLab：面向AI科学家的交互式因果发现可扩展环境

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpqhg9d60515slnomo04irxk
- 原文链接：https://arxiv.org/abs/2605.26029

## AI 摘要

本文介绍了CausaLab，这是一个评估LLM智能体交互式因果发现能力的可扩展环境。该环境在一个合成实验室内评估两个维度：智能体能否利用因果证据解决问题，以及其答案是否基于忠实恢复的因果机制。每个实验中，智能体接收先验观测数据，对操纵晶体进行干预，并预测反应晶体的共振频率。隐藏的数据生成过程是随机采样的结构因果模型（SCM），成功要求恢复因果图和结构方程。实验表明预测与机制恢复之间存在差距：在6节点纯观测设置中，GPT-5.2-high的任务准确率达92%，但全边F1值仅为0.471。混合观测-干预策略能提升结构保真度，而纯干预对强智能体仍具挑战。研究发现过早停止是主要弱点，一致性验证能缓解该问题。CausaLab将预测成功与因果理解分离开来，揭示了当前LLM智能体作为实验因果推理者的局限。

## 正文

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.
