# MiMo-V2.5 系列 API 降价背后的推理优化

- 来源：Fuli Luo (@_LuoFuli)
- 发布时间：2026-05-30 18:41
- AIHOT 分数：63
- AIHOT 链接：https://aihot.virxact.com/items/cmps8jgo607qzslljhmkhy9s5
- 原文链接：https://x.com/_LuoFuli/status/2060672928367497480

## AI 摘要

MiMo-V2.5 系列模型（包括 MiMo-V2.5 和 MiMo-V2.5-Pro）采用混合滑动窗口注意力（Hybrid SWA）架构，将 KVCache 存储压缩至全注意力的约1/7。为将架构优势转化为实际收益，团队重新设计了 KVCache 管理、分层缓存和前缀缓存树，并优化了 SWA KVCache 处理、调度及 Prefill/Decode 流水线。经真实生产流量验证，这些优化将有效 KVCache 容量提升近5倍，主流框架下服务器端缓存命中率达93%-95%。结合 MoE 配置调优与多模态推理优化，提升了长上下文推理效率，是近期 API 降价的基础。

## 正文

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions

Read the full technical blog： https://mimo.xiaomi.com/blog/mimo-v2-5-inference

The V2.5 model family， including MiMo-V2.5 and MiMo-V2.5-Pro， is built on a Hybrid Sliding Window Attention （Hybrid SWA） architecture， which compresses KVCache storage to roughly 1/7 that of Full Attention. However， architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains， we redesigned KVCache management， tiered caching， and the prefix-cache tree； addressed key challenges in SWA KVCache handling； and optimized scheduling as well as the Prefill/Decode pipeline.

Validated on real production traffic， these optimizations have increased effective KVCache capacity by nearly 5x， with server-side cache hit rates averaging 93%-95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations， they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.
