Fuli Luo@_LuoFuli

2026-05-27 20:50·36天前

AI 摘要

本次价格调整源于模型架构与推理框架带来的结构性成本优势。推理框架层面，对SWA的层级KV cache优化使缓存容量提升5倍，相当于缓存成本降低80%，再结合混合模型中多个Full Attention模块的缓存读取重叠，进一步降低了实际成本。模型架构层面，MiMo-V2.5-Pro实现了极端的1:7 Full:SWA稀疏比例，其预填充计算量极低，使得原始推理成本远低于行业平均。因此，输入（缓存命中）价格最高降幅达99%，输入（缓存未命中）和输出价格降幅为60%-80%。此番调整是将效率提升直接让利给开发者，而非亏损运营。

Behind the MiMo API Price Reduction： The deepest price cut， up to 99%， is for Input （Cache Hit）. The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token capacity by 5x， equivalent to an 80% reduction in caching costs. Combined with Cache Read Overlap among multiple Full Attention modules in the Hybrid model， actual costs are further reduced.

Prices for Input （Cache Miss） and Output are also reduced by 60%-80%. This mainly benefits from the extreme 1：7 Full：SWA sparsity ratio brought by the model architecture （the prefill compute of the 70-layer MiMo-V2.5-Pro roughly equals a 10-layer GQA model）. This kept our original inference costs well below the industry average， naturally leaving a 2x-3x profit margin in pricing. This price adjustment simply reflects our decision to pass these structural cost efficiencies directly to developers.

Operating at these newly reduced API prices， our production inference engine is running at near full capacity， and we can still essentially break even. We previously advised LLM companies not to "blindly cut prices" precisely because very few model architectures and inference optimizations can keep API costs from running at a loss. If more architectures that save compute and KV cache emerge， along with better inference Infra to drive down API costs， this will form an excellent virtuous cycle in the industry.

More crucially， affordable， high-performance model APIs will drive real， sustained， and at-scale inference demand. This upstream demand pulls forward the development of the entire AI infrastructure chain-including chips， servers， optical transceivers， PCBs， liquid cooling， power， energy storage， and data centers-serving as a strategic fulcrum for a systemic revaluation of AI hardware. In the long run， this injects more affordable and accessible compute into both training and inference pipelines， accelerating the parallel evolution of global AGI across multiple regions and technical routes.

Fuli Luo@_LuoFuli · X

59导出 Markdown

2026-05-27 20:50·36天前

在 X 看原推· x.com

AI 摘要