VibeThinker是一个仅3B参数的推理模型,采用SFT+GRPO训练,在推理基准上与Opus 4.5几乎持平。在AIME26上达94.3,LiveCodeBench v6上80.2 Pass@1,近期未见过的LeetCode竞赛中接受率达96.1%,匹配或超越DeepSeek V3.2等大数个量级的旗舰系统。模型基于Qwen2.5-Coder 3B,经过硬样本筛选、多解监督训练、数学/代码/STEM可验证奖励强化学习、自蒸馏、指令聚焦RL及测试时答案检查方法CLR训练而成。
VibeThinker is a 3B param model, with almost head to head benchmark result with Opus 4.5 on reasoning with novel SFT+GRPO.
Unusually strong for its size: with only 3B parameters, 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance on recent unseen LeetCode contests.
"places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2"
They start from a 3B Qwen2.5-Coder base model, then train it with carefully filtered hard examples, multi-solution supervised training, reinforcement learning on math/code/STEM tasks with verifiable rewards, self-distillation, instruction-focused RL, and a test-time answer-checking method called CLR.