原文 · 未翻译
MiniMax M3: Open-weight model with a million-token context challenges proprietary leaders
Chinese AI company MiniMax has released its new model M3. It's billed as the first open-weight model to combine top-tier coding performance, a one-million-token context window, and native multimodality.
According to MiniMax, that combination was previously out of reach for open models and reserved for proprietary systems like Opus 4.7, GPT-5.5, or Gemini 3.1 Pro. A new attention mechanism makes the leap possible by stretching the context window to one million tokens without letting compute costs spiral out of control. In internal tests, M3 also planned, debugged, and self-corrected on its own over many hours.
Benchmarks put M3 in proprietary territory
On SWE-Bench Pro, an established software development benchmark, M3 scores 59 percent according to MiniMax. That puts it ahead of GPT-5.5 and Gemini 3.1 Pro, but just behind Opus 4.7. M3 also lands in proprietary-class territory on terminal tasks and tool use. On autonomous web search, it actually pulls ahead of Opus 4.7 (79.3) with 83.5 points on BrowseComp. Anthropic has since shipped Opus 4.8, a somewhat stronger model.
To get closer to real developer workflows, MiniMax built a simulator framework that mimics typical behavior patterns. These include refining requirements, discussing solution approaches, reacting to intermediate results, and carrying tasks across multiple contexts. This exposes the model to multi-turn collaboration during training, not just single, clearly defined prompts.
Three tests show long-running autonomy
MiniMax describes three internal experiments designed to show how these capabilities work together. In the first, the team had M3 independently reproduce a paper on LLM fine-tuning. The model worked for nearly twelve hours without intervention, produced 18 commits and 23 figures, and confirmed the paper's key findings.
In the second test, M3 was asked to optimize a compute kernel for matrix multiplications on Nvidia Hopper GPUs, one of the most compute-intensive building blocks in large-model inference. Experienced teams typically need one to two weeks for this, according to MiniMax. M3 got only a task description, a benchmark script, and a non-functional code skeleton with no reference solution to copy from. After about 24 hours, the model had pushed Hopper hardware utilization from 7.6 to 71.3 percent. Most other tested models gave up after a few dozen attempts, while M3 worked through several plateaus and didn't reach its best solution until attempt 145.
When optimizing an FP8 kernel, M3 reaches 71.3 percent of Hopper peak performance after 147 runs, pulling ahead of Opus 4.7. Anthropic's model needs far fewer runs, though.
In the third test, PostTrainBench, M3 was tasked with independently training four base models, synthesizing data, training, evaluating, and iterating without human input. The model landed just behind Opus 4.7 and GPT-5.5 but well ahead of the remaining tested models.