Rohan Paul@rohanpaul_ai

2026-05-24 14:12·39天前

AI 摘要

近期有技术爱好者成功在单张二手RTX 3060 12GB显卡上，运行了拥有1万亿参数的Kimi K2.5大语言模型，速度约为每秒4个token。这一成果得益于模型的混合专家架构，虽然总参数量巨大，但每次推理仅激活32B参数。实现的关键在于将延迟敏感的核心组件置于GPU显存，而将庞大的专家权重存储在由二手英特尔傲腾持久内存（PMem）构成的768GB大容量内存池中，并以DDR4内存作为缓存。通过llama.cpp工具进行混合调度，该方案为本地部署超大规模模型提供了一条低成本的技术路径。

Somebody just ran one trillion param model （Kimi K2.5） on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory.

What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the most time-sensitive organs.

i.e. the bulk of the sparse expert weights live in a larger， cheaper memory tier and are pulled into the computation as needed.

This worked because Kimi K2.5 is a Mixture-of-Experts model， so it has 1T total parameters but activates only 32B per token.

The RTX 3060's 12GB VRAM holds latency-sensitive parts like routing， attention， dense layers， and shared experts.

The huge expert weights sit in Optane PMem， configured as RAM， while 192GB DDR4 ECC acts as cache.

He is using 6 Optane PMem （DCPMM） sticks. This retired memory format was made to bridge DRAM and SSD performance. The 768GB Optane configuration， using 6x128GB modules， does beat the best NVMe SSDs on latency by a wide margin， but remains 2x to 3x slower than DRAM.

llama.cpp handled hybrid GPU/CPU inference， with tensor placement tuned through flags like override-tensor.

The result was roughly 4 tokens/sec， which is slow for chat but impressive for a local 1T-parameter model on cheap retired enterprise hardware.

The DDR4 acted as cache， the Optane acted as a giant memory pool， and llama.cpp pushed routing and other critical tensors onto the 12GB GPU.

开源生态教程/实践端侧

Rohan Paul@rohanpaul_ai · X

51导出 Markdown