仅从分数很难判断 MAI-Thinking-1 有多好(比如 GPQA 和 Terminal Bench 2.0 的分数低得奇怪) 但微软在模型发布后很难让人试用(这是许多微软 AI 产品的通病),所以我不太清楚。不过数据低于 Meta Spark。
It is difficult to know how good MAI-Thinking-1 is from the scores alone (like weirdly low GPQA &; Terminal Bench 2.0)
But Microsoft makes it really hard to try its models upon release (a general issue with many Microsoft AI products), so I dunno. Stats below Meta Spark, though.