Ethan Mollick主张用自定义基准测试评估模型,而非依赖通用基准或直接换模型。他举例:翻译埃及象形文字用Gemini 3.5 Flash,运行自动售货机用Opus 4.8。JakeABoggs的HieroglyphBench测试显示,Anthropic Fable 5与GPT-5.5持平,但均远落后于Gemini系列,其中Gemini 3.5 Flash得分是Fable 5的两倍以上。
You really need your own benchmarks. If you are translating hieroglyphics, use Gemini 3.5 Flash. If you are running a vending machine use Opus 4.8.
(This is one reason why I am skeptical of just swapping out models to optimize costs or generic benchmarks without testing first)