# 语言模型智能体的探索与利用错误可被量化测量

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo0xuwt4035fsli29qvzm4sm
- 原文链接：https://arxiv.org/abs/2604.13151

## AI 摘要

研究人员构建了受具身AI场景启发的可控测试环境，包含部分可观察的2D网格地图与未知任务DAG，并设计策略无关的评估指标，实现了对语言模型智能体探索与利用错误的量化测量。评估显示，当前前沿模型在该任务上表现挣扎且呈现不同失败模式，而推理模型展现出更强的解决能力。研究还发现，通过最小化的工程调整即可显著提升智能体的探索与利用表现。相关代码已开源发布。

## 正文

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code https://github.com/jjj-madison/measurable-explore-exploit{here}.