提前承诺:LLM智能体过早固守证据的隐藏故障诊断
阅读原文· arxiv.org长周期LLM智能体会出现“过早承诺”故障——早期选定证据解读并固守,最终答案评分无法捕捉。研究用跨运行隐藏状态收敛性作为承诺指标。在Llama-3.1-70B运行ReAct于HotpotQA上,第4步隐藏状态相似性预测下游行为一致性(r=-0.35,偏相关-0.45)。信号在Qwen-2.5-72B、Phi-3-14B及StrategyQA(r=-0.83)复现。承诺不追踪正确性。运行时监测器检测不一致轨迹,AUROC最高0.97(严格拆分0.85-0.88);提示词干预将行为方差降低28%且准确率无显著变化。结果提供了一个隐藏过程故障诊断工具,并明确了局限性。
Long-horizon LLM agents can fail quietly: they settle on one reading of the evidence early, then spend the rest of the run defending it. We call this premature commitment. Final-answer scoring misses the failure mode because it sees only the answer, not whether the process has already collapsed to a stable path. We define representational commitment as cross-run hidden-state convergence at a fixed reasoning step, and use it as an early diagnostic of trajectory consistency. On Llama-3.1-70B running ReAct on HotpotQA, step-4 hidden-state similarity predicts downstream behavioral consistency (r = -0.35, partial r = -0.45), with a localized temporal and layer-wise signature. The signal replicates across Qwen-2.5-72B and Phi-3-14B, and on StrategyQA (r = -0.83). It does not track correctness: committed-wrong and committed-correct questions are not separable in activation similarity. That boundary is central to the claim. Commitment tells us whether an agent has settled, not whether it is right. A runtime monitor detects inconsistent trajectories from hidden states at AUROC up to 0.97 (0.85--0.88 under a stricter split), and a prompting intervention cuts behavioral variance by 28% against a token-matched control while leaving accuracy statistically unchanged. We also test whether the signal can route self-consistency compute; on a harder benchmark it helps only modestly and is matched by a simpler output-based baseline. The result is a diagnostic for a hidden process failure, with clear limits rather than a general accuracy lever.