Artificial Analysis@ArtificialAnlys

2026-05-01 14:13·62天前

AI 摘要

上周，Kimi K2.6、MiMo V2.5 Pro和DeepSeek V4 Pro三大领先开源模型发布，在Artificial Analysis Intelligence Index上得分达52-54分，与顶尖闭源模型GPT-5.5的60分差距缩小至6分以内，相比一年前22分的开源模型进步显著。这些模型均为万亿参数规模的MoE架构。然而，在复杂推理、智能体编码及知识准确性方面，开源模型与闭源模型仍存在明显差距。例如在HLE、CritPt和TerminalBench Hard等专项评估中得分大幅落后；在Omniscience评估中，DeepSeek V4 Pro的幻觉问题尤为突出。

All three leading open weights models were released last week. Progress continues for open weights models alongside proprietary ones， with the gap to GPT-5.5， the leading proprietary model， sitting at 6 points on the Artificial Analysis Intelligence Index

@Kimi_Moonshot's Kimi K2.6 （Reasoning） and @Xiaomi's MiMo V2.5 Pro （Reasoning） tie as the leading open weights models on the Artificial Analysis Intelligence Index at 54， with @deepseek_ai's DeepSeek V4 Pro （Reasoning， Max Effort） at 52. This places the best open weights models within 3-6 points of the leading proprietary models： @OpenAI's GPT-5.5 （xhigh） at 60， and @Google's Gemini 3.1 Pro Preview and @AnthropicAI's Claude Opus 4.7 （Adaptive Reasoning， Max Effort） at 57.

For context： just one year ago the highest-scoring open weights model was DeepSeek V3 0324 which achieved 22 on the Intelligence Index， and was ~13 points below the highest-scoring proprietary model， Claude 3.7 Sonnet （Reasoning） at 35.

Key takeaways：

➤ The top three most intelligent open weights models are trillion-plus-parameter MoE architectures with permissive licenses. Kimi K2.6 （Reasoning） has 1T total / 32B active parameters with 256K context window， MiMo V2.5 Pro （Reasoning） has 1T total / 42B active with 1M context window， and DeepSeek V4 Pro （Reasoning， Max Effort） has 1.6T total / 49B active with 1M context window.

➤ The gap to proprietary remains wide on the hardest reasoning and agentic coding evaluations. On HLE （Humanity's Last Exam） the three top open weights models score 34-36%， vs 44% for GPT-5.5 （xhigh） and 45% for Gemini 3.1 Pro Preview. On CritPt （Research-level Physics） they score 4-12%， vs 27% for GPT-5.5 （xhigh）. On TerminalBench Hard （Agentic Coding & Terminal Use） they score 43-46%， vs 61% for GPT-5.5 （xhigh） and 54% for Gemini 3.1 Pro Preview.

➤ Omniscience （knowledge + hallucination） shows a large gap to proprietary models， with DeepSeek V4 Pro （Reasoning， Max Effort） hallucinating significantly more than its open weights peers. DeepSeek V4 Pro （Reasoning， Max Effort） scores -10， MiMo V2.5 Pro （Reasoning） +4， and Kimi K2.6 （Reasoning） +6. By comparison， GPT-5.5 （xhigh） scores +20， Claude Opus 4.7 （Adaptive Reasoning， Max Effort） +26， and Gemini 3.1 Pro Preview +33.

Artificial Analysis@ArtificialAnlys · X

57导出 Markdown

2026-05-01 14:13·62天前

在 X 看原推· x.com

AI 摘要