新一代混合注意力模型通过压缩KV Cache,使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群,仅回传轻量KV Cache至本地解码,短请求则本地处理。配合智能路由与带宽感知调度,可在普通以太网高效传输。实测1T参数模型显示,50%请求远程处理时跨集群流量仅13Gbps,吞吐量提升54%,打破长上下文AI局限于单一数据中心的瓶颈。
Big claim in this paper. "Prefill-as-a-Service"
Prefill, the heaviest part of inference, may finally be portable.
Long-context AI is no longer trapped inside a single datacenter.
Shows how to run LLM prefill on remote clusters by sending much smaller saved prompt state. So long-prompt work can be done on remote machines and sending back only the smaller saved state needed to answer.
The breakthrough is not sending everything farther, but sending the right requests farther.
---
When you ask a model a long question, it first has to read and digest the whole prompt before it starts answering.
That first step is called prefill, and it is brutally compute-heavy.