MiniMax M3模型通过Live Session分享了核心信息。其MSA技术采用块级Top-K选择,保持真实、未压缩的KV缓存,使1M token上下文窗口高效运行。该技术将长上下文生成的注意力内核解码时间从约30%降至约5%,效率提升显著。M3是原生多模态模型,支持图像视频输入,可处理长程智能体任务及桌面操作,并具备视觉自评估迭代能力。模型在金融任务中展现出初级分析师水平。未来版本将聚焦更复杂的长程任务,并扩展金融、法律与生物领域。Together AI为其提供推理服务。
We wrapped a live session on M3 yesterday with the @togethercompute team & our researchers @zpysky1125 and @HaohaiSun
A few highlights 🧵 1. MSA (MiniMax Sparse Attention) is the star ⭐️. Unlike CSA/HCA, which compress the KV cache, MSA keeps the real, uncompressed KV and does block-level selection with a small top-K. That's how the 1M context window stays tractable.
- The efficiency win is huge. In our previous generation, ~30% of per-decode wall-clock time went to the attention kernel. With MSA that now drops to ~5%. Big gains for long-context generation.
- M3 isn't just a coding model. Natively multimodal (image + video in), ability to handle long-horizon agentic tasks, and even operate a desktop computer. People are already throwing game-dev + Minecraft-style builds at it (Unity included) and it's holding its own.