# 字节跳动与中国人民大学发布扩散语言模型 iLLaDA，基础能力追平 Qwen2.5

- 来源：The Decoder：AI News（RSS）
- 作者：Maximilian Schreiner
- 发布时间：2026-06-27 15:48
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmqw2xh180014sl1nkilmvj1m
- 原文链接：https://the-decoder.com/bytedances-illada-is-a-diffusion-language-model-that-keeps-up-with-qwen2-5

## AI 摘要

中国人民大学与字节跳动联合发布 iLLaDA，一个 8B 参数、从头训练的密集扩散语言模型。该模型从掩码 token 序列开始，通过多次并行迭代双向精炼文本，不同于自回归模型的逐 token 生成。iLLaDA 在 12 万亿 token 上预训练，并经过 12 轮微调。基础版本 iLLaDA-Base 平均得分 63.9，略超 Qwen2.5 7B 的 63.3，其中推理测试 BBH 提升 21.6 分至 71.3。但指令微调版 iLLaDA-Instruct 得分 67.1，落后于 Qwen2.5 7B Instruct 的 77.1，差距主要在数学和代码任务，作者归因于缺少额外的强化学习对齐。

## 正文

ByteDance's "iLLaDA" is a diffusion language model that keeps up with Qwen2.5

Maximilian Schreiner View the LinkedIn Profile of Maximilian Schreiner

Jun 27, 2026

Nano Banana Pro prompted by THE DECODER

Researchers from Renmin University and Bytedance have released iLLaDA, an 8B language model that works differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning.

Nearly all well-known AI language models like GPT, Claude, or Qwen generate text autoregressively: word by word, left to right, with each new token depending only on the ones before it.

Diffusion language models take a different approach. They start with a sequence of placeholders, called masked tokens, and refine them across multiple passes in parallel. It's similar to how image models shape a picture from noise. Every position can attend to every other position at the same time, making the process bidirectional.

iLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind released DiffusionGemma. That model generates text about four times faster via diffusion but scores worse on benchmarks like MMLU and code than the similarly sized autoregressive Gemma 4. Google recommends it for low-latency use cases, not quality-critical production.

DiffusionGemma takes a different approach. It's built on the Gemma 4 backbone, a 25-billion-parameter mixture-of-experts model that swaps only the generation method to prioritize speed. iLLaDA, short for "improved LLaDA," goes the other way. It's a dense 8B model trained from scratch, focused on quality.

The question behind all of this is whether a diffusion model built from the ground up can actually keep up with autoregressive models. A direct numerical comparison between the two is tough, though. Google uses partly different and harder benchmark variants, and DiffusionGemma plays in a different weight class.

What iLLaDA can do

The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH, for example. On average it hits 63.9 points, edging just past the autoregressive Qwen2.5 7B at 63.3.

iLLaDA 8B LLaDA 8B Dream 7B Qwen2.5 7B

Model Diffusion Diffusion Diffusion AR

Training tokens 12T 2.3T 18T + 0.6T 18T

General Tasks

MMLU 74.8 65.9 69.5 71.9

BBH 71.3 49.7 57.9 63.9

ARC-C 60.8 45.9 59.8 51.5

Hellaswag 76.6 70.5 73.3 79.0

Mathematics & Science

GSM8K 81.9 70.3 77.2 78.9

Math 38.4 31.4 39.6 41.1

Code

HumanEval 50.0 35.4 57.9 56.7

MBPP 57.8 40.0 56.2 63.6

Average 63.9 51.1 61.4 63.3

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.

A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.
