Google DeepMind Drops DiffusionGemma 26B — a Text Diffusion Model That Generates 4x Faster Than Anything Autoregressive

What Is DiffusionGemma — and Why It's Not Just Another Open Model / 什么是 DiffusionGemma — 为什么它不是又一个普通开源模型

Google DeepMind released DiffusionGemma 26B A4B IT on June 10, 2026 — and it breaks the one rule that has governed large language models for the past five years. It is not autoregressive. It does not generate one token at a time.

Instead, DiffusionGemma uses text diffusion: it generates entire 256-token blocks in parallel, then iteratively refines them through a denoising process. When low-confidence tokens appear, a re-noise step resets them and regenerates — self-correction that autoregressive models physically cannot do once a token is committed.

The result? 1000+ tokens per second on a single H100, peaking above 1100 tps at low batch sizes — roughly 4x faster than comparable Gemma 4 models. And because it's a 25.2B-parameter Mixture-of-Experts architecture with only 3.8B active parameters, it runs on 18GB of RAM, fitting comfortably on an RTX 4090.

The announcement came via Google DeepMind on X and was covered in detail by Ars Technica and MarkTechPost.

Why Text Diffusion Matters / 为什么文本扩散很重要

The Autoregressive Bottleneck / 自回归的瓶颈

Every major LLM you've used — GPT-4, Claude, Gemini, Llama — generates text left-to-right, one token at a time. Each token depends on every token before it. This is a serial process by definition: you cannot generate token N+1 until token N exists.

It works. It produces coherent text. But it's slow, and it cannot self-correct mid-stream. If the model writes a malformed Markdown table on token 47, it cannot go back and fix it — the mistake is permanent.

The Diffusion Alternative / 扩散模型的替代方案

Diffusion has dominated image and video generation for years (Stable Diffusion, Imagen, Sora). The idea — start with noise and iteratively clean it up — has proven wildly effective for pixels. But text is discrete, not continuous. You can't "partially" generate the word "banana." This discreteness problem kept diffusion out of text generation for years.

DiffusionGemma solves it: it generates block-level continuous representations, denoises them into coherent text, and uses a re-noise mechanism to revisit low-confidence spans. The architecture is encoder-decoder with bidirectional attention, built on the Gemma 4 26B A4B backbone. It's an experimental open model, not a production Gemini replacement, but it proves the paradigm is viable.

Key Specs / 关键规格

Parameters: 25.2B total, 3.8B active (Mixture-of-Experts)
Architecture: Encoder-decoder, bidirectional attention
Context window: 256K tokens
Modalities: Text, image, and video INPUT → text OUTPUT
Languages: 35+ supported
Reasoning: Configurable "thinking" mode
Function calling: Native support
License: NVIDIA Open Model Agreement (per NVIDIA NIM model card)
Speed: 1000+ tokens/sec on single H100 (FP8), peaks above 1100 tps

Where It Shines — and Where It Doesn't / 优势与短板

Strengths / 优势

Real-time Markdown generation. The re-noise self-correction means DiffusionGemma is better at formatting complex Markdown in real time than autoregressive models, per Google's announcement. Tables, code blocks, nested lists — the model can fix structural mistakes without restarting the generation.

Local, low-latency inference. At 18GB RAM requirement (per Unsloth), this runs on consumer hardware. Single-user GPU workloads — coding assistants, real-time translation, interactive writing tools — are the design target.

Speed-critical workflows. When latency matters more than peak reasoning quality — streaming chat UIs, agent tool-call loops, on-device summarization — the 4x speed advantage changes what's practical.

The Tradeoff / 代价

As noted on r/Unsloth: "love the speed, but sad it's a quality chunk under the 12b." DiffusionGemma's output quality lands below dense 12B models. Google does not claim otherwise — this is an experimental model optimizing for throughput, not a frontier reasoning engine.

For tasks where correctness is the only metric, stick with autoregressive. For tasks where responsiveness defines the experience, DiffusionGemma is the first credible alternative.

Who Should Care / 谁应该关注

Local AI developers: If you run models on your own GPU and feel the pain of 30-50 tps generation, DiffusionGemma in FP8 is a step change.
Real-time application builders: Chat UIs, coding copilots, live transcription — anything where sub-100ms latency matters.
Researchers exploring post-autoregressive architectures: This is the first open-weights text diffusion model with production-usable throughput. The re-noise self-correction loop may point toward faster agent reasoning cycles.
People who format a lot of Markdown: The self-correction is genuinely useful for structured output.

Integrations / 集成方式

NVIDIA NIM: Hosted deployment at build.nvidia.com
Hugging Face: Model weights available for direct download
Unsloth: Fine-tuning guide at unsloth.ai/docs/models/diffusiongemma

The Bigger Picture / 更广阔的图景

DiffusionGemma won't replace GPT-4 or Gemini tomorrow. That's not the point. The point is that for five years, autoregressive decoding has been the only game in town for text. DiffusionGemma proves there's another path — and it's 4x faster, runs locally, and fixes its own mistakes along the way.

The first open-weights diffusion model that actually works at scale is worth paying attention to, even if the quality isn't frontier. The same was true of the first diffusion image models. Give it a year.

Explore 40+ AI tools on TokenJoy.ai

Real reviews, pricing, and comparisons — updated weekly.

Browse AI Tools →