Anthropic 的 Opus 4.8 在 DeepSWE 基准测试中表现较 Opus 4.7 有显著提升,同时降低了每项任务的平均成本。具体而言,在默认高思考努力(xhigh)设置下,其得分比 Opus 4.7 xhigh 高出 6%。然而,GPT-5.5 xhigh 在该项测试中仍以明显优势领先,且成本更低。推文作者对 OpenAI 近期的模型发布印象深刻,并期待 GPT-5.6,同时也开始认可 Opus 4.8,认为当前正处于两家前沿实验室持续推出真正令人印象深刻模型的时刻。
Opus 4.8 is a solid jump over Opus 4.7 on DeepSWE, while also lowering the average cost per task.
However, GPT-5.5 xhigh still beats it by a pretty clear margin while being cheaper.
OpenAI has been cooking insanely hard with its models lately. Really excited to see what GPT-5.6 brings.
That said, I have to admit: I'm starting to really like Opus 4.8 as well.
We've entered a moment where both frontier labs keep shipping genuinely impressive models.