FastKernels：面向生产环境的GPU内核生成基准测试

2026-05-22 08:00·42天前

AI 摘要

现有用于GPU内核生成的AI智能体测试基准与生产推理框架严重脱节。它们仅在单一GPU上使用合成输入评估内核，忽略了实际的编译技术栈，并奖励复现已知优化而非发现新方法。为此，我们提出FastKernels。它既是一个涵盖8个类别、46个代表性架构的内核基准（其内核覆盖了96.2%的HuggingFace Transformers架构），也是一个极简的生产级推理框架，性能与vLLM和SGLang等成熟系统相当。实验表明，最强的内核生成智能体在FastKernels上仅能实现0.94倍的整体加速，证实了基准与生产环境的错位是关键瓶颈。

原文 · 未翻译

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

HuggingFace Daily Papers（社区热门论文）

60导出 Markdown

FastKernels：面向生产环境的GPU内核生成基准测试

2026-05-22 08:00·42天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译