阿里巴巴提出一种通过双强化学习飞轮训练智能体的新方法,并基于此推出了AgenticQwen-30B-A3B模型。该模型总参数量为300亿,但每次推理仅激活30亿参数,在TAU-2和BFCL-V4多轮工具使用基准测试中取得了50.2的平均分,性能与参数量达2350亿的Qwen3-235B相当。其核心在于并行运行两个飞轮:推理循环将模型自身错误转化为更难训练问题;智能体循环则将简单工具使用轨迹扩展为多分支行为树,并通过模拟用户误导主动增加训练难度。该方法意味着开发者无需为常规工具任务支付高昂的尖端模型成本,且飞轮配方可复用,能从智能体自身失败中生成困难样本。
NEW paper from Alibaba.
A 30B MoE with only 3B active params matches Qwen3-235B on real tool-use workloads.
AgenticQwen-30B-A3B: 50.2 average on TAU-2 + BFCL-V4 Multi-Turn.
AgenticQwen-8B: 47.4.
Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model.
How: two RL flywheels run in parallel.
- The reasoning loop mines the model's own errors into harder problems each round.