Rohan Paul@rohanpaul_ai

2026-04-19 16:20·74天前

AI 摘要

新一代混合注意力模型通过压缩KV Cache，使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群，仅回传轻量KV Cache至本地解码，短请求则本地处理。配合智能路由与带宽感知调度，可在普通以太网高效传输。实测1T参数模型显示，50%请求远程处理时跨集群流量仅13Gbps，吞吐量提升54%，打破长上下文AI局限于单一数据中心的瓶颈。

Big claim in this paper. "Prefill-as-a-Service"

Prefill， the heaviest part of inference， may finally be portable.

Long-context AI is no longer trapped inside a single datacenter.

Shows how to run LLM prefill on remote clusters by sending much smaller saved prompt state. So long-prompt work can be done on remote machines and sending back only the smaller saved state needed to answer.

The breakthrough is not sending everything farther， but sending the right requests farther.

---

When you ask a model a long question， it first has to read and digest the whole prompt before it starts answering.

That first step is called prefill， and it is brutally compute-heavy.

The second step is decode， where the model generates tokens one by one， and that part is more about memory bandwidth than raw compute.

But moving the saved prompt state between those phases is usually so data-heavy that both parts must stay in the same tightly connected cluster.

So Until now， those two steps usually had to stay close together inside the same fast network， because prefill creates a huge blob of temporary memory called KVCache that had to be moved quickly to the decode machine.

That is the bottleneck.

What changed is model design.

Newer hybrid-attention models produce much smaller KVCache than older dense-attention models， so shipping that state across ordinary datacenter links starts to become practical instead of absurd.

The paper's idea is a Prefill-as-a-Service setup that sends only long， uncached prompts to a remote prefill cluster， then ships back the saved prompt state， called KV cache， over normal Ethernet while short requests stay local.

This works mainly because newer hybrid-attention models create far less KV cache than older dense models， and the system adds smart routing， bandwidth-aware scheduling， and cache-aware placement so the network does not clog up.

Rohan Paul@rohanpaul_ai · X

导出 Markdown