Deedy@deedydas

2026-04-03 23:06·90天前

AI 摘要

通过将 speculative decode 卸载至两片 2GB SRAM/chip 的 Corsairs 芯片，在标准 GPU 运行 gpt-oss-120b 时实现 10 倍延迟降低与超 1400 tokens/秒的吞吐，额外硬件成本极低，性价比惊人。

This is the best blog post on LLM inference I've seen this year.

They achieved 10x latency and &gt；1400 tokens/sec by moving speculative decode onto two 2GB SRAM/chip Corsairs， a small cost on top of a standard GPU setup on gpt-oss-120b.

This performance at this price is insane.

开源/仓库部署/工程

在 X 查看原推导出 Markdown

Deedy@deedydas · X

导出 Markdown