# Ethan Mollick：你真的需要自己的基准测试

- 来源：Ethan Mollick (@emollick)
- 发布时间：2026-07-02 23:02
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmr3mwn3d004wsl7lyj4xutqt
- 原文链接：https://x.com/emollick/status/2072697436963832023

## AI 摘要

Ethan Mollick主张用自定义基准测试评估模型，而非依赖通用基准或直接换模型。他举例：翻译埃及象形文字用Gemini 3.5 Flash，运行自动售货机用Opus 4.8。JakeABoggs的HieroglyphBench测试显示，Anthropic Fable 5与GPT-5.5持平，但均远落后于Gemini系列，其中Gemini 3.5 Flash得分是Fable 5的两倍以上。

## 正文

You really need your own benchmarks. If you are translating hieroglyphics， use Gemini 3.5 Flash. If you are running a vending machine use Opus 4.8.

（This is one reason why I am skeptical of just swapping out models to optimize costs or generic benchmarks without testing first）

### 引用推文

> Jake Boggs：Fable 5 is a large step for Anthropic's vision capabilities and effectively ties with GPT-5.5 on HieroglyphBench, my benchmark which tests how well VLMs can tra...