# 阿里发布智能体训练新方法：双强化学习飞轮催生高效工具使用模型

- 来源：elvis (@omarsar0)
- 发布时间：2026-04-27 04:49
- AIHOT 分数：64
- AIHOT 链接：https://aihot.virxact.com/items/cmog94bz100bisl5r0zsw3i1j
- 原文链接：https://x.com/omarsar0/status/2048504655932760565

## AI 摘要

阿里巴巴提出一种通过双强化学习飞轮训练智能体的新方法，并基于此推出了AgenticQwen-30B-A3B模型。该模型总参数量为300亿，但每次推理仅激活30亿参数，在TAU-2和BFCL-V4多轮工具使用基准测试中取得了50.2的平均分，性能与参数量达2350亿的Qwen3-235B相当。其核心在于并行运行两个飞轮：推理循环将模型自身错误转化为更难训练问题；智能体循环则将简单工具使用轨迹扩展为多分支行为树，并通过模拟用户误导主动增加训练难度。该方法意味着开发者无需为常规工具任务支付高昂的尖端模型成本，且飞轮配方可复用，能从智能体自身失败中生成困难样本。

## 正文

NEW paper from Alibaba.

A 30B MoE with only 3B active params matches Qwen3-235B on real tool-use workloads.

AgenticQwen-30B-A3B： 50.2 average on TAU-2 + BFCL-V4 Multi-Turn.

AgenticQwen-8B： 47.4.

Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model.

How： two RL flywheels run in parallel.

- The reasoning loop mines the model's own errors into harder problems each round.

- The agentic loop grows simple linear tool-use trajectories into multi-branch behavior trees.

- Simulated users actively try to mislead the agent. The training distribution gets harder on its own.

Why it matters for agent devs： you can stop paying frontier prices for routine tool-use workloads.

And the flywheel recipe is reusable. Generate your hard examples from your own agent's failures， not from static synthetic data.

Paper： https://arxiv.org/abs/2604.21590

Learn to build effective AI agents in our academy： https://academy.dair.ai/
