MiniMax M3：具有百万token上下文窗口的开源权重模型挑战闭源领先者

2026-06-01 21:38·31天前·Jonathan Kemper

AI 摘要

中国AI公司MiniMax发布了新模型M3。它被定位为首个同时具备顶级编码性能、100万token上下文窗口以及原生多模态能力的开源权重模型。

原文 · 未翻译

MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders

Chinese AI company MiniMax has released its new model M3. It's billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.

According to MiniMax, that combination was previously out of reach for open models and reserved for proprietary systems like Opus 4.7, GPT-5.5, or Gemini 3.1 Pro. A new attention mechanism makes the leap possible by stretching the context window to one million tokens without letting compute costs spiral out of control. In internal tests, M3 also planned, debugged, and self-corrected on its own over many hours.

Benchmarks put M3 in proprietary territory

On SWE-Bench Pro, an established software development benchmark, M3 scores 59 percent according to MiniMax. That puts it ahead of GPT-5.5 and Gemini 3.1 Pro, but just behind Opus 4.7. M3 also lands in proprietary-class territory on terminal tasks and tool use. On autonomous web search, it actually pulls ahead of Opus 4.7 (79.3) with 83.5 points on BrowseComp. Anthropic has since shipped Opus 4.8, a somewhat stronger model.

To get closer to real developer workflows, MiniMax built a simulator framework that mimics typical behavior patterns. These include refining requirements, discussing solution approaches, reacting to intermediate results, and carrying tasks across multiple contexts. This exposes the model to multi-turn collaboration during training, not just single, clearly defined prompts.

Three tests show long-running autonomy

MiniMax describes three internal experiments designed to show how these capabilities work together. In the first, the team had M3 independently reproduce a paper on LLM fine-tuning. The model worked for nearly twelve hours without intervention, produced 18 commits and 23 figures, and confirmed the paper's key findings.

In the second test, M3 was asked to optimize a compute kernel for matrix multiplications on Nvidia Hopper GPUs, one of the most compute-intensive building blocks in large-model inference. Experienced teams typically need one to two weeks for this, according to MiniMax. M3 got only a task description, a benchmark script, and a non-functional code skeleton with no reference solution to copy from. After about 24 hours, the model had pushed Hopper hardware utilization from 7.6 to 71.3 percent. Most other tested models gave up after a few dozen attempts, while M3 worked through several plateaus and didn't reach its best solution until attempt 145.

When optimizing an FP8 kernel, M3 reaches 71.3 percent of Hopper peak performance after 147 runs, pulling ahead of Opus 4.7. Anthropic's model needs far fewer runs, though.

In the third test, PostTrainBench, M3 was tasked with independently training four base models, synthesizing data, training, evaluating, and iterating without human input. The model landed just behind Opus 4.7 and GPT-5.5 but well ahead of the remaining tested models.

The Decoder：AI News（RSS）

64导出 Markdown