# 内存主导但非带宽受限：批量1大语言模型解码在物理AI推理中的差距

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpvbduw205bysl0z0c8oft1o
- 原文链接：https://arxiv.org/abs/2605.30571

## AI 摘要

研究表明，物理AI系统中的批量1大语言模型解码是内存主导的，但更快的内存并不带来比例性的延迟收益。通过对三款7-8B级别的GQA Transformer模型在四款NVIDIA GPU上的测量发现，例如在Qwen-2.5-7B（上下文长度2048）场景下，L4能达到其内存地板的81%，而H100仅为27%。CUDA Graphs优化在H100上将解码延迟提升1.259倍，在L4上仅为1.028倍。部署方面，常见的量化路径未能完全兑现预期的4倍权重流量削减，例如AutoAWQ+Marlin在bf16基线62.32 ms/step上优化至45.24 ms/step，而GPTQ+ExLlamaV2能达到17.36 ms/step。

## 正文

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.
