Ethan Mollick@emollick

2026-07-02 23:02·8小时前

AI 摘要

Ethan Mollick主张用自定义基准测试评估模型，而非依赖通用基准或直接换模型。他举例：翻译埃及象形文字用Gemini 3.5 Flash，运行自动售货机用Opus 4.8。JakeABoggs的HieroglyphBench测试显示，Anthropic Fable 5与GPT-5.5持平，但均远落后于Gemini系列，其中Gemini 3.5 Flash得分是Fable 5的两倍以上。

You really need your own benchmarks. If you are translating hieroglyphics， use Gemini 3.5 Flash. If you are running a vending machine use Opus 4.8.

（This is one reason why I am skeptical of just swapping out models to optimize costs or generic benchmarks without testing first）

Jake BoggsFable 5 is a large step for Anthropic's vision capabilities and effectively ties with GPT-5.5 on HieroglyphBench, my benchmark which tests how well VLMs can tra...

多模态大佬观点评测/基准

在 X 查看原推导出 Markdown

Ethan Mollick@emollick · X

50导出 Markdown