# CoffeeBench：长期异构多智能体经济系统中的大语言模型智能体基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：37
- AIHOT 链接：https://aihot.virxact.com/items/cmqwqyy1j006lslikkmk3nb62
- 原文链接：https://arxiv.org/abs/2606.16613

## AI 摘要

CoffeeBench评估大语言模型智能体在长期多智能体经济系统中的表现。模拟由两个农民、两个烘焙师和两个零售商组成的90天异构企业经济，每个智能体通过通信和交易最大化累计净收入。评测模型控制一个咖啡烘焙师，其余由固定参考智能体控制。测试多个开源和专有LLM，所有模型均优于不采取行动的被动基线，多数实现正净收入。表现更好的模型通信更频繁，而Claude Haiku 4.5出现空闲漂移失败模式，反复选择不作为。

## 正文

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.
