AI 摘要
PDOOM警报🚨:约48%的端到端LLM延迟是预填充,约52%是解码。预填充本身分为两个操作: 🟠 预填充扩展(缓存写入)——摄入新上下文/文件,写入新的KV token 🟠 缓存读取——重用先前轮次的现有KV缓存
PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:
🟠 Prefill extend (cache write) - ingests new context/files, writes fresh KV tokens 🟠 Cache read - reuses existing KV cache from prior turns