# Google 新开源模型 DiffusionGemma：从噪声生成文本，而非逐字逐词

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-06-11 03:20
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmq8ggxdy0299slldt21fzf73
- 原文链接：https://the-decoder.com/googles-new-open-model-diffusiongemma-generates-text-from-noise-instead-of-word-by-word

## AI 摘要

Google 发布 260 亿参数的 DiffusionGemma 模型，文本生成方式不再逐 token 预测，而是通过扩散过程从噪声直接生成，类似图像 AI 将噪声转化为图片。Nvidia 测试显示，该模型在单块 H100 GPU 上可达约每秒 1,000 tokens，速度约为可比自回归模型的四倍。代价是输出质量较低，因此 Google 目前将其定位为面向开发者的实验性工具。模型已开源。

## 正文

Google's new open model DiffusionGemma generates text from noise instead of word by word

Key Points

Google has released DiffusionGemma, an experimental language model that generates text using a diffusion-based method, producing blocks of 256 tokens at once rather than generating text word by word.

By processing tokens in parallel, the model makes better use of graphics processors, achieving speeds up to four times faster than traditional models when running in single-user mode on dedicated GPUs.

While the generated text quality is lower compared to conventional models, the approach is particularly well suited for non-linear tasks such as inserting text after the fact or filling in gaps in program code.

Google released an experimental model with open weights that generates text through diffusion instead of word by word. On a single GPU, it runs up to four times faster in single-user mode than classic language models. Nvidia handled the optimization.

Most language models generate one token after another, basing each new token on the previous one. DiffusionGemma takes a different approach. It starts with a block of 256 random placeholder tokens and refines them across several passes until readable text emerges. The idea comes from image AI, where diffusion models turn noise into clear images.

The model has 26 billion parameters total but only activates 3.8 billion per step. That's thanks to a mixture-of-experts architecture, where several specialized sub-networks sit side by side and only the right ones fire depending on the input. When quantized to lower precision, the model fits into 18 GB of VRAM on high-end consumer GPUs, according to Google. It builds on the Gemma 4 family and borrows its diffusion process from Google's earlier research on Gemini Diffusion.

Better GPU usage explains the speed gains

Nvidia says the speed advantage comes down to hardware usage. With autoregressive models, single-user inference is often bottlenecked by memory bandwidth. The GPU's compute units sit idle most of the time, just waiting for data from memory. Engineers call this memory-bound. DiffusionGemma sidesteps the problem by processing up to 256 tokens in parallel, pushing the bottleneck toward raw compute instead. The result is that GPUs actually stay busy.

Nvidia reports about 1,000 tokens per second on an H100 when processing a single request, 150 tokens per second on the DGX Spark deskside system, and up to 2,000 tokens per second on the DGX Station. On the GeForce RTX 5090, Google claims more than 700 tokens per second. In local single-user mode, the model runs about four times faster on dedicated GPUs than a comparable autoregressive model.

Google ties this effect to dedicated accelerators. On shared-memory setups like Apple Silicon, which are often memory-bandwidth-limited during inference themselves, the gap over autoregressive models is likely smaller.

In cloud serving with many parallel requests, the advantage flips. Autoregressive models already keep the hardware busy in that scenario, so DiffusionGemma can actually drive costs up, according to Google.

Speed comes at a quality cost, but new use cases open up

DiffusionGemma trades output quality for speed. Google still recommends the regular Gemma 4 models when quality matters most and positions DiffusionGemma as a tool for researchers and developers experimenting with local, fast workflows.

Where Google sees real strengths is in tasks that don't work left to right. Because the model considers the entire block at once, each token can reference every other token during generation, including ones that come later. Traditional language models can only look backward.

That makes it useful for inserting text into existing paragraphs, filling gaps in code, or working with structured data like amino acid sequences and mathematical graphs. As an example, Google points to an Unsloth fine-tune where DiffusionGemma solves Sudoku. Autoregressive models struggle with that task because each entry depends on later entries.

Open weights with broad tool support from day one

The model weights are on Hugging Face under an Apache 2.0 license. DiffusionGemma works out of the box with common inference libraries like Hugging Face Transformers, vLLM (with Red Hat integration support), and MLX. For fine-tuning, Google points to its own JAX toolkit Hackable Diffusion, along with Unsloth and the Nvidia NeMo Framework. Support for llama.cpp is planned.

Nvidia quantized the model for the RTX 5090 and 4090 and optimized it for Hopper and Blackwell server architectures, including DGX Spark and DGX Station for local deskside setups. It's also available through the Gemini Enterprise Agent Platform Model Garden and Nvidia NIM.

Google also published a DiffusionGemma developer guide, and Maarten Grootendorst put together a visual guide explaining how the model works.

Gemini Diffusion laid the groundwork

Google Deepmind had already shown an early experimental demo of a text-based diffusion model with Gemini Diffusion. At the time, Deepmind cited speeds of 1,479 tokens per second. In benchmarks, Gemini Diffusion performed roughly on par with Gemini 2.0 Flash-Lite.

The startup Inception is pursuing the same parallel diffusion approach. Its Mercury 2 shipped in early 2026 as what the company calls the first diffusion-based reasoning model.

AI News Without the Hype – Curated by Humans
