# 分离式推理中的无政府代价

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-11 08:00
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmqi29vj504vkslf0xbz5v97q
- 原文链接：https://arxiv.org/abs/2606.17081

## AI 摘要

分离式推理架构将 prefill 和 decode 阶段分配到不同 GPU 池，形成共享硬件预算的竞争“智能体”。研究首次用博弈论建模该架构，以 NVIDIA Dynamo 为案例，拆解为三个耦合博弈。在 3 节点 B200 集群上用 Nemotron-4-340B 和 Llama-3.1-70B 验证，两模型呈现相同三阶段 PoA-hat 结构。自适应路由可在饱和阶段大幅降低 PoA-hat：70B 1P/5D 拓扑下 PoA-hat 从 66.4 降至 21.5（3.1 倍），吞吐量损失 13%；70B 1P/2D 下 PoA-hat 降 2.2 倍，TTFT P99 降 7.6 倍。

## 正文

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).