DiffusionGemma attacks the slowest LLM habit: one token at a time | Radar

Google DeepMind introduced DiffusionGemma, an experimental open text generation model that the model page says can produce up to 4x to 5x faster output on NVIDIA GPUs and exceed 1,000 tokens per second on a single H100. The model is built on Gemma 4 and Gemini Diffusion research, and instead of typical token by token generation it works with parallel generation of larger blocks of text.

DiffusionGemma turns decoding from a queue into parallel text revision

Google describes DiffusionGemma as a non-sequential transformer. Rather than appending the next token after the last one, the model generates full paragraphs and iteratively refines them. The page says it can generate 256 tokens in parallel in each forward pass, allowing every token to attend to other parts of the emerging text.

The model uses a Mixture of Experts architecture with 26B total parameters and 3.8B active parameters during inference. Google says a quantized version fits within 24 GB of VRAM on an NVIDIA RTX 5090 or 4090. Access is available through Hugging Face, Kaggle and Vertex AI Model Garden.

Local AI is targeting latency, not just privacy

Local models are often sold through privacy, cost and control over data. DiffusionGemma adds a different product thesis: if generation is fast enough, interactive workflows become viable where slow autoregressive models feel clumsy. Inline editing, code infilling and structured text repair make more sense when the model is not waiting on a long token chain.

For developers, the hardware economics matter too. A model that activates only 3.8B out of 26B parameters promises a compromise between capacity and speed. That does not automatically mean better quality, but it is a more interesting direction than another larger checkpoint.

Speed alone will not solve quality or tooling

The word experimental deserves attention. Diffusion for text has different failure modes than autoregressive decoding. Parallel revision may help global consistency, but developers will need real tests for factuality, code, multilingual behavior and long instructions.

Also, 1,000 tokens per second on an H100 is not the same as pleasant performance on an ordinary laptop. Local adoption will depend on the software stack, quantization, runtime support and behavior on available GPUs, not just the highest number on a marketing page.

Integration into everyday runtimes will decide the impact

The next signals are straightforward: support in Hugging Face tooling, fast runs in local runtimes, comparable evals against Gemma and Qwen models, and editor demos where parallel generation actually changes UX. If this remains a model card and a few benchmarks, the impact will be limited.

If diffusion decoding moves into common developer tools, it could change expectations for local assistants. Not because they become omniscient. Because they may finally stop writing like someone pressing one key at a time.

Lilith's verdict

DiffusionGemma is a runner refusing to queue for every token. If it keeps its balance outside the H100 benchmark stadium, local assistants may feel less like a typewriter and more like an editor looking at the whole paragraph.