# DeepSeek与小米MiMo大模型降价的技术根源

- 来源：Chubby♨️ (@kimmonismus)
- 发布时间：2026-05-27 18:12
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmpnxhsbk016cslv4cbfexm7m
- 原文链接：https://x.com/kimmonismus/status/2059578380329394292

## AI 摘要

DeepSeek V4-Pro宣布永久降价75%，小米MiMo V2.5降价高达99%。此次降价核心是架构革新带来的成本结构性降低。DeepSeek V4通过混合注意力架构大幅压缩了长上下文推理的KV缓存，使其在100万token时仅为V3.2的10%，单token推理FLOPs降至27%。小米MiMo团队则通过SGLang HiCache实现滑动窗口注意力，将KV缓存跨内存数据传输量减少至约1/7。这些架构优化使V4-Pro定价降至$0.87/百万输出token，MiMo V2.5-Pro约为$3/百万，两者均为拥有百万上下文窗口的前沿级模型。降价源于推理与缓存成本的实质性下降。

## 正文

DeepSeek just made its 75% price cut on V4-Pro permanent. Xiaomi's MiMo slashed V2.5 pricing by up to 99%， effective today. Most coverage frames this as a price war. The more interesting part is the engineering that makes these numbers sustainable.

DeepSeek's V4 paper describes a *hybrid attention architecture* that attacks the core bottleneck of long-context inference： the KV cache. Traditional transformers store key-value pairs for every token in the context. At 1 million tokens， this cache alone can fill an entire GPU's memory. V4 introduces two interleaved attention types.

Compressed Sparse Attention （CSA） compresses every 4 tokens into a single KV entry， then selects only the top-k most relevant compressed blocks per query. Heavily Compressed Attention （HCA） goes further， compressing 128 tokens into one entry and running dense attention over the result. The compressed sequence is short enough that dense attention stays cheap.

V4-Pro's KV cache at 1M tokens is 10% （！！） of V3.2's. Single-token inference FLOPs drop to 27% （！！）. The model has 1.6 trillion total parameters but only activates 49 billion per token through Mixture-of-Experts routing， the knowledge capacity of a massive model at the compute cost of one thirty times smaller.

MiMo's approach is different but lands in the same place. Xiaomi's team implemented Sliding Window Attention via SGLang HiCache， reducing KV cache data transfer across GPU memory， CPU memory， and SSD to roughly 1/7 （！！） of previous volume. Cacheable tokens expanded by 5x （！！）. Combined with expert parallelism optimization and input length bucketing， per-token serving cost dropped enough to make permanent pricing at these levels viable.

V4-Pro now sits at $0.87 per million output tokens. MiMo V2.5-Pro at roughly $3/M output， with Flash variants far below that. A year ago， sub-dollar output pricing meant you were using a small distilled model with real capability tradeoffs. These are frontier-class reasoners with million-token context windows.

Both companies can commit to permanent cuts because the reductions come from the architecture itself. When your attention mechanism physically processes fewer FLOPs per token and your cache occupies a fraction of the memory， the cost to serve is structurally lower. The price follows the cost curve.