通过将 speculative decode 卸载至两片 2GB SRAM/chip 的 Corsairs 芯片,在标准 GPU 运行 gpt-oss-120b 时实现 10 倍延迟降低与超 1400 tokens/秒 的吞吐,额外硬件成本极低,性价比惊人。
This is the best blog post on LLM inference I've seen this year.
They achieved 10x latency and >;1400 tokens/sec by moving speculative decode onto two 2GB SRAM/chip Corsairs, a small cost on top of a standard GPU setup on gpt-oss-120b.
This performance at this price is insane.