CausaLab：面向AI科学家的交互式因果发现可扩展环境

2026-05-28 08:00·36天前

AI 摘要

本文介绍了CausaLab，这是一个评估LLM智能体交互式因果发现能力的可扩展环境。该环境在一个合成实验室内评估两个维度：智能体能否利用因果证据解决问题，以及其答案是否基于忠实恢复的因果机制。每个实验中，智能体接收先验观测数据，对操纵晶体进行干预，并预测反应晶体的共振频率。隐藏的数据生成过程是随机采样的结构因果模型（SCM），成功要求恢复因果图和结构方程。实验表明预测与机制恢复之间存在差距：在6节点纯观测设置中，GPT-5.2-high的任务准确率达92%，但全边F1值仅为0.471。混合观测-干预策略能提升结构保真度，而纯干预对强智能体仍具挑战。研究发现过早停止是主要弱点，一致性验证能缓解该问题。CausaLab将预测成功与因果理解分离开来，揭示了当前LLM智能体作为实验因果推理者的局限。

原文 · 未翻译

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

HuggingFace Daily Papers（社区热门论文）

60导出 Markdown

CausaLab：面向AI科学家的交互式因果发现可扩展环境

2026-05-28 08:00·36天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译