# Kog在标准GPU上实现超高速大语言模型推理

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-05-29 07:11
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmpq41x6p01l7slnoweajccbp
- 原文链接：https://x.com/rohanpaul_ai/status/2060137022845862008

## AI 摘要

Kog AI 在标准数据中心 GPU 上实现了惊人的推理速度：在 8× AMD MI300X 上达到 3,000 tokens/s，在 8× NVIDIA H200 上达到 2,100 tokens/s（FP16，无推测解码），而常规速度通常为 100-300 tokens/s。其技术核心是将大语言模型解码视为内存流问题，通过将整个 token 生成循环置于单一持久 GPU 程序内、优化内存访问拓扑以降低跨芯片延迟、并采用延迟张量并行技术来大幅减少开销。Kog 今日开放技术预览，提供 2B 编码模型，并计划后续支持大型前沿 MoE。

## 正文

Some truly massive inference numbers here.

@Kog__AI just achieved 3，000 tokens/s on 8× AMD MI300X GPUs and 2，100 on 8× NVIDIA H200 （FP16， no speculative decoding） with a 2B model.

For comparison， typical GPU decoding speed for 2B to 8B models on high-end GPUs is around 100 to 300 tokens/s per sec.

They achieved it by treating LLM decoding as a memory-streaming problem： keep the whole token-generation loop inside one persistent GPU program， so kernel launches， CPU scheduling， intermediate memory writes， and sampling interruptions mostly disappear.

Then they cut synchronization waste by making each compute unit wait only for the exact data it needs， while mapping memory access to the MI300X's chiplet topology so the GPU stops paying avoidable cross-die latency.

Finally， their model architecture delays tensor-parallel communication so all-reduce work happens in the background instead of blocking every layer， which is why the runtime， GPU code， and model design all have to be co-designed.

### 引用推文

> Kog：🚀 Launch today: Kog generates 3,000+ output tokens/s per single request, on standard datacenter GPUs. We are bringing real-time LLM inference to hardware that ...
