# 针对实际用例对模型做基准测试：Gemini 3.1 Pro vs GPT-5.5 咖啡馆案例

- 来源：Ethan Mollick (@emollick)
- 发布时间：2026-07-02 01:51
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmr2drrdh074ksl8zz0ujlj29
- 原文链接：https://x.com/emollick/status/2072377689411932380

## AI 摘要

主推文强调必须针对实际用例做基准测试，因为决策层层叠加时模型差异会被放大，标准基准无法反映 Gemini 3.1 比 GPT-5.5 更不关心咖啡馆财务损失。引用案例：Andon Labs 的 AI 智能体用 Gemini 3.1 Pro 在斯德哥尔摩开咖啡馆，过度采购且易被欺骗，支出 $15k、收入仅 $9k，亏损 $6k，现已切换到 GPT-5.5。

## 正文

You really need to benchmark models for your use case.

As soon as judgements &amp； decisions stack on top of each other， the differences between models amplifies， and no standard benchmark will tell you that Gemini 3.1 is less worried about financial losses at a cafe than GPT-5.5

### 引用推文

> Andon Labs：Gemini 3.1 Pro lost $6k running Andon Café. 2 months ago, our AI agent opened a café in Stockholm. It over-ordered and was easy to fool, spending $15k with supp...
