# Prime Intellect 发布 prime-rl 0.6.0，用于万亿参数 MoE 模型的智能体强化学习训练

- 来源：MarkTechPost（RSS）
- 作者：Asif Razzaq
- 发布时间：2026-06-23 15:20
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmqqc5zvw08heslp5zzcvss8r
- 原文链接：https://www.marktechpost.com/2026/06/23/prime-intellect-releases-prime-rl-0-6-0-to-train-trillion-parameter-moe-models-on-agentic-rl-workloads

## AI 摘要

prime-rl 0.6.0 是一个开源异步强化学习框架，针对万亿参数 MoE 模型，聚焦长周期智能体任务（如软件工程）。研究团队在 GLM-5 上训练 SWE 任务，序列长度达 131k，步时间低于5分钟，batch size 256，仅用28个H200节点。推理优化包括 FP8（DeepEP、DeepGEMM 内核）、宽专家并行（≥32 GPU）、前填充与解码分离、KV 缓存分层卸载（vLLM 原生或 Mooncake Store）以及路由重放（R3，降低 KL 不匹配约一个量级）。训练基于 torchtitan，采用3D并行（FSDP2、上下文并行、专家并行）和块缩放 FP8（由 DeepSeek V3 提出），以匹配推理精度并稳定训练。

## 正文

Prime Intellect has released prime-rl version 0.6.0. The framework targets reinforcement learning on trillion-parameter Mixture-of-Experts (MoE) models. It focuses on heavy agentic workloads, like long-horizon software-engineering tasks.

The research team trained GLM-5 on SWE tasks at up to 131k sequence length. Step times stayed under five minutes. The batch size was 256 rollouts. The run used only 28 H200 nodes.

TL;DR

prime-rl 0.6.0 trains trillion-parameter MoE models on agentic RL workloads.

GLM-5 trained on SWE at 131k sequence length, sub-5-minute steps, 28 H200 nodes.

Asynchronous RL disaggregates trainer and inference for independent optimization.

Inference uses FP8, Wide EP, P/D disaggregation, KV offloading, and router replay.

Training uses 3-D parallelism (FSDP, EP, CP) plus block-scaled FP8.

What is prime-rl 0.6.0?

prime-rl is an open framework for asynchronous reinforcement learning. It post-trains large open-source models on agentic tasks. Version 0.6.0 extends this to trillion-parameter MoE scale.

The example model in the announcement is zai-org/GLM-5.1. The optimizations also apply to other large MoE models. Examples include moonshotai/Kimi-K2.7-Code and nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.

A full GLM-5.1 run starts with one command on a Slurm cluster.

Copy CodeCopiedUse a different Browser

uv run rl @ examples/glm5_llmd/rl.toml --output-dir /shared/outputs/glm5-llmd

Role of asynchronous RL

Agentic tasks have long-tail outliers. Some coding rollouts run for hours. Waiting for them before each policy update would idle GPUs.

Asynchronous RL avoids this. The trainer and inference systems are disaggregated. They run and scale independently. The inference policy updates as soon as the optimizer step finishes.

There is one synchronization point: the policy update. prime-rl pushes new weights as soon as they exist. Already-dispatched rollouts keep their active prefix cache. So a single rollout may mix tokens from several policy versions.

New rollouts behave differently. They repopulate their own KV cache, even when prefixes match. A KV-cache salt forces this. Requests from too old a policy are dropped. The max_off_policy_steps value controls that threshold.

Inference optimizations

Inference is usually the throughput bottleneck in an RL system. prime-rl optimizes for throughput, while keeping latency bounded.

FP8 inference: Lower precision speeds up prefill and decode. prime-rl uses FP8 with DeepEP and DeepGEMM kernels.

Wide Expert Parallelism: Wide EP spreads experts across ≥32 GPUs. It pairs with a large data-parallel rank, for example 32. Each GPU holds separate experts and serves as an endpoint. Synchronization happens per-layer, through dispatch and combine operations.

Prefill and Decode Disaggregation: Some model↔env pairs hit a 4:1 prefill:decode token ratio. Shared workers would inflate end-to-end latency. That reduces the benefits of PipelineRL. P/D disaggregation separates prefill and decode workers. Long tool outputs then stop throttling decode workers.

KV cache management: High concurrency needs large KV cache space. prime-rl supports tiered offloading to CPU and disk. vLLM native offloading creates one pool per worker. Mooncake Store instead pools RAM and disk across all nodes centrally.

Request routing: prime-rl ships a fork of vllm-router by default. It also supports the NVIDIA Dynamo router as a drop-in. Routers score workers using KV cache reuse, queue depth, and live load.

Router replay (R3): Trainer↔inference mismatch silently kills training. Router replay captures inference routing decisions. It replays them directly on the trainer. This cuts KL mismatch by roughly an order of magnitude. Routed experts have shape [num_layers, top_k, seq_len]. This payload can grow to hundreds of GB. At scale, the data rate reaches tens of Gbps. So prime-rl treats it as an opaque payload. Optimized PyTorch operations handle the processing.

Training optimizations

The trainer builds on torchtitan, a PyTorch-native training codebase. It relies on 3-D parallelism: FSDP, CP, and EP. The GLM-5 case study uses all three.

StrategyWhat it shardsPrimary useKey detail

FSDP (FSDP2)Parameters, gradients, optimizer statesBaseline memory amortizationGathers weights on demand per layer via fully_shard

Expert Parallelism (EP)Experts within a layerShrinks active layer memoryall2all dispatch/combine; torch-native or DeepEP

Context Parallelism (CP)The sequence dimensionLong-context activation memoryUlysses (default) or Ring Attention

EP exists because layers stay huge after FSDP. With 78 layers and 800B params in float32, one layer’s all-gather needs roughly 40GB. Overlapping one layer pushes that near 80GB. Setting EP=8 dispatches tokens instead of gathering full experts. torch-native all2all is slightly faster within one node. DeepEP wins when EP spans multiple nodes.

CP matters at 131k+ sequence length. There, activations dominate memory, not parameters. GLM-5 uses DSA, which neither Ulysses nor Ring Attention parallelizes directly. So prime-rl ships a custom context-parallel implementation for it.

FP8 training. prime-rl uses DeepGEMM block-scaled FP8, as proposed by DeepSeek V3. This rarely raises throughput, due to quantization overhead. Its real value is matching trainer and inference precision. That reduces KL mismatch and stabilizes training.

Interactive Explainer

Use cases with examples

Long-horizon SWE agents: Train a model on real repository issues. Rollouts can span 100s of turns and tool calls. P/D disaggregation keeps decode latency predictable here.

1T-scale post-training on fewer nodes: The GLM-5 run fit on 28 H200 nodes. Wide EP and KV offloading raise concurrency and throughput.

Stable agentic RL at scale: Router replay and FP8 training both reduce trainer↔inference KL mismatch. Lower mismatch means steadier training.