Ethan Mollick@emollick

2026-07-02 01:51·1天前

AI 摘要

主推文强调必须针对实际用例做基准测试，因为决策层层叠加时模型差异会被放大，标准基准无法反映 Gemini 3.1 比 GPT-5.5 更不关心咖啡馆财务损失。引用案例：Andon Labs 的 AI 智能体用 Gemini 3.1 Pro 在斯德哥尔摩开咖啡馆，过度采购且易被欺骗，支出 $15k、收入仅 $9k，亏损 $6k，现已切换到 GPT-5.5。

You really need to benchmark models for your use case.

As soon as judgements &amp； decisions stack on top of each other， the differences between models amplifies， and no standard benchmark will tell you that Gemini 3.1 is less worried about financial losses at a cafe than GPT-5.5

Andon LabsGemini 3.1 Pro lost $6k running Andon Café. 2 months ago, our AI agent opened a café in Stockholm. It over-ordered and was easy to fool, spending $15k with supp...

智能体 Google OpenAI 现象/趋势

在 X 查看原推导出 Markdown

Ethan Mollick@emollick · X

61导出 Markdown

2026-07-02 01:51·1天前

在 X 看原推· x.com

AI 摘要

You really need to benchmark models for your use case.

Andon LabsGemini 3.1 Pro lost $6k running Andon Café. 2 months ago, our AI agent opened a café in Stockholm. It over-ordered and was easy to fool, spending $15k with supp...

智能体 Google OpenAI