SemiAnalysis@SemiAnalysis_

2026-05-27 07:00·37天前

AI 摘要

PDOOM警报🚨：约48%的端到端LLM延迟是预填充，约52%是解码。预填充本身分为两个操作： 🟠 预填充扩展（缓存写入）——摄入新上下文/文件，写入新的KV token 🟠 缓存读取——重用先前轮次的现有KV缓存

PDOOM ALERT 🚨 ： ~48% of e2e LLM latency is prefill， ~52% is decode. Prefill itself breaks into 2 ops：

🟠 Prefill extend （cache write） - ingests new context/files， writes fresh KV tokens 🟠 Cache read - reuses existing KV cache from prior turns

SemiAnalysis@SemiAnalysis_ · X

2026-05-27 07:00·37天前

AI 摘要

PDOOM ALERT 🚨 ： ~48% of e2e LLM latency is prefill， ~52% is decode. Prefill itself breaks into 2 ops：

🟠 Prefill extend （cache write） - ingests new context/files， writes fresh KV tokens 🟠 Cache read - reuses existing KV cache from prior turns