在标准GPU上进行实时大语言模型推理：单次请求生成速度达3k tokens/s

2026-05-29 22:37·34天前·NicoConstant

AI 摘要

该技术成果展示了在标准GPU硬件上实现大语言模型实时推理的可能性。核心性能指标为单次请求的生成速度可达到3000个tokens每秒（3k tokens/s per request）。这一结果表明，对于特定场景或模型配置，即使在非专用集群的常规计算设备上，也能实现高速的模型输出，对于降低大语言模型的使用门槛和成本具有参考意义。

原文 · 未翻译

TL;DR: we show that AI inference on GPUs can be super-fast, reaching the speed regime of dedicated inference hardware cards when optimizing the whole software stack with architecture/engine/kernel co-design. Test the speed in our live coding playground: playground.kog.ai.

This post explains why optimizing for single-request LLM decoding speed is important for AI agents; why it's primarily a memory-bandwidth maximization problem, not a FLOPS one; why standard datacenter GPU hardware has a much higher decoding-speed ceiling than current inference stacks expose due to software bottlenecks; and how that ceiling can be reached (even on large MoE models) by co-designing the model architecture, runtime, and low-level GPU code as a single latency-optimized pipeline.

Our public tech preview is about proving that extremely fast single-request decoding is possible on the standard datacenter GPUs enterprises already own — including AI labs and sovereign-AI buyers. The limiting factor has been that existing inference software stacks are not optimized for this type of workload. Opening the GPU path could deliver that speed without the lock-in of proprietary silicon.

You can test the speed of our 2B coding model today. It's small and not a frontier model (we've been focused on speed rather than scale), though still quite capable when fine-tuned for specific software engineering tasks.

What autonomous agents change: single-request decode speed is now the metric that matters

Inference benchmarks typically conflate three quantities. Aggregate throughput (total tokens generated per second across all users) measures server utilization and rewards large batches. Time to first token measures prefill latency. Decode speed per request measures token generation speed and defines how long one user waits before receiving the full response. That last one governs every long serial interaction, and it's what AI agents are bottlenecked on.

Hacker News 热门（buzzing.cc 中文翻译）

60导出 Markdown

在标准GPU上进行实时大语言模型推理：单次请求生成速度达3k tokens/s

2026-05-29 22:37·34天前·NicoConstant

阅读原文· blog.kog.ai

AI 摘要

原文 · 保持原样，未翻译