主推文强调必须针对实际用例做基准测试,因为决策层层叠加时模型差异会被放大,标准基准无法反映 Gemini 3.1 比 GPT-5.5 更不关心咖啡馆财务损失。引用案例:Andon Labs 的 AI 智能体用 Gemini 3.1 Pro 在斯德哥尔摩开咖啡馆,过度采购且易被欺骗,支出 $15k、收入仅 $9k,亏损 $6k,现已切换到 GPT-5.5。
You really need to benchmark models for your use case.
As soon as judgements &; decisions stack on top of each other, the differences between models amplifies, and no standard benchmark will tell you that Gemini 3.1 is less worried about financial losses at a cafe than GPT-5.5