I tried Google's new DiffusionGemma, and watching it generate text like an image is unlike any local LLM

Most local LLMs feel predictable now. You download a model, point a runtime at it, ask a question, and then watch text move across the screen one token at a time. The model might be better or worse than the one you used yesterday, but the basic experience is usually the same.

DiffusionGemma is different, at least when you run it in its visual mode. Google’s new experimental Gemma model doesn’t just type its answer from left to right. Instead, it works on a block of text at once, gradually replacing and refining tokens until the answer settles into place. The effect is similar to watching an image generator denoise a picture, which is what the process of “diffusion” refers to. It’s a very different experience compared to your typical LLM generating token by token.

I tried it on an M4 Pro MacBook Pro using the 4-bit GGUF through the custom llama.cpp fork detailed by Unsloth. It didn’t feel faster to me than running Google’s regular Gemma 4 26B-A4B model, and it also hammered my Mac in a way LLMs typically don’t, causing a full system-wide slowdown. Still, it’s a strange experience, but also uniquely exciting given how different it is compared to your typical autoregressive language model.

DiffusionGemma changes how text generation looks

Visual mode shows you exactly what’s happening

I tried Google’s new DiffusionGemma, and watching it generate text like an image is unlike any local LLM

DiffusionGemma feels odd because the output doesn’t arrive like normal text. With the visual mode enabled, you can watch a 256-token canvas get rewritten as the model works, with placeholder-looking text appearing first before pieces of it change and the answer gradually becomes more coherent. It isn’t just a stream of words appearing at the end of the previous word, and that alone makes it feel like a different category of local model.

It sounds gimmicky, and in a sense it is. You don’t need to watch generation happen in order for a model to be useful, and plenty of local LLM interfaces are better precisely because they hide the messy parts. But in this case, the visualization does a good job of explaining what makes DiffusionGemma different. You can read about text diffusion as much as you want, but seeing the text repeatedly change in place makes the concept much easier to understand.

A normal autoregressive model has to commit to the next token, then the one after that, then the one after that. It can plan in a loose sense, and good models clearly do, but the token it writes now cannot directly condition on the exact token it will write 50 tokens later because that token doesn’t exist yet. DiffusionGemma instead works over a block, with bidirectional attention inside that canvas. It can use later parts of the block to refine earlier parts, which is why the output can look like it’s coming into focus rather than being typed.

That’s the conceptual benefit of diffusion-based language models, even before getting into speed. A 256-token canvas gives the model a temporary draft space where the beginning and end of a block can influence each other before the block is committed. That concept is why diffusion is so interesting for things like inline editing, code infilling, structured text, and other cases where the best answer isn’t always easiest to produce from left to right.

That’s also why DiffusionGemma feels so different compared to the local models people normally use. We’re all used to seeing Qwen, Gemma, Llama, or whatever else stream text in a way that makes the model feel like it’s writing. DiffusionGemma in visual mode feels more like it’s editing a draft in front of you, except you can see every weird intermediate state on the way there.

Google’s speed claims need context

Especially if you’re running it on a Mac

Google’s pitch for DiffusionGemma is speed. In its launch post, Google says the model can deliver up to 4x faster text generation on dedicated GPUs, with 1,000+ tokens per second on a single Nvidia H100 and 700+ tokens per second on an RTX 5090. It also says the quantized model can fit within 18GB of VRAM on high-end consumer GPUs.

My M4 Pro run did not look anything like that. I didn’t get a normal tokens-per-second readout, but the footer I captured reported 137.9 seconds total, 123 denoising steps, and 9 blocks, which works out to 1.121 seconds per step. Since each block is a 256-token canvas, that also works out to 2,304 canvas positions across 123 steps, or about 18.7 token positions per denoising step.

That number is what I’m pretty sure people have been missing. Google has talked about parallel denoising and generating 15 to 20 tokens per forward pass, so a headline figure like 700 tokens per second shouldn’t be read the same way as 700 left-to-right autoregressive tokens appearing cleanly on screen. I don’t think Google is simply counting discarded guesses as finished output, but DiffusionGemma reaches its output differently: it refines many token positions inside a canvas before committing the block. The speed can be real, but the experience isn’t the same as watching a normal model stream 700 final tokens every second.

The hardware matters too. My Mac slowed down system-wide while it was running, and it didn’t feel faster than running Google’s regular Gemma 4 26B-A4B locally. Google warns that Apple Silicon Macs may not see the same acceleration because unified-memory systems are often memory-bandwidth-bound during inference, while DiffusionGemma’s speedup relies on giving a dedicated accelerator a larger compute-heavy workload.

That doesn’t make the speed claim wrong, it just means the interesting part of my run was never going to be raw throughput. The interesting part was seeing the model use a visibly different generation process, and seeing how much that changed the feel of interacting with a local LLM.

Running it locally is still early and a bit awkward

This isn’t normal llama.cpp support yet

The route I used to get this up and running was the Unsloth GGUF path, which depends on the DiffusionGemma branch from an open llama.cpp pull request. Unsloth’s instructions build a dedicated llama-diffusion-cli runner, because the standard llama-cli or llama-server path can’t generate from the model yet. You can see what the generation looks like in the above video.

That distinction matters if you’re used to Ollama or llama.cpp being the easy local LLM defaults. This isn’t the kind of model you casually pull into your existing setup and treat like another GGUF. It needs the right branch, the right runner, and the –diffusion-visual flag if you want the part that makes it visually interesting. The command to run it with visual output, once compiled, is:

./llama-diffusion-cli -m ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 4096 --diffusion-visual

The quantized files are at least realistic for consumer hardware. Unsloth lists a 16GB Q4KM file as the smallest option, with larger 18GB, 21GB, 25GB, and 47GB variants above it. That puts the model in the same general world as other large local models you might run on a GPU with a decent amount of VRAM.

It’s still an experimental setup, though. The model support, runner, and visual output are all part of the point right now, not rough edges around an otherwise boring daily-driver model. If you’ve been hearing about diffusion models for a while and want to check one out yourself, that’s the appeal.

DiffusionGemma is not a straight upgrade over Gemma 4

Google says quality is the trade-off

The name makes DiffusionGemma sound like another member of the Gemma family, which it is, but the model has a very different goal. Google describes it as an experimental open model based on the Gemma 4 26B A4B Mixture of Experts architecture, with roughly 26B total parameters and about 4B active parameters. The unusual part is the diffusion head and block-based generation, not the basic idea of a local MoE model.

Google is very clear that its standard autoregressive Gemma 4 models remain the recommendation for maximum output quality. DiffusionGemma prioritizes speed and parallel layout generation, and the published benchmark table generally shows it trailing the standard Gemma 4 26B A4B model across reasoning, coding, vision, and long-context tests.

The concrete test I ran was at least functional. I asked it to make a Flappy Bird-style game in Python, rendered in a browser and served with Flask, and the generated project worked when I tested it. The gravity was far too strong, so it wasn’t exactly pleasant to play, but it did produce the Flask app, HTML, CSS, and JavaScript needed to get a working browser game on screen. You can see it running in the video above, as I copy and pasted the code from the output into the designated files. You can also read through the full output at this Gist link.

It doesn’t matter what I had asked the model to generate, and Flappy Bird was just one of an infinite number of prompts I could have used, because the result remains the same: the model did something normal while the generation process looks anything but. DiffusionGemma isn’t interesting because it suddenly makes local coding better; it’s interesting because it exposes a different way of building text, one where the output is drafted, refined, and committed in blocks instead of streamed one token at a time. The experience I had here lines up with Google’s examples around inline editing, non-linear text structures, code infilling, and other workflows where left-to-right generation isn’t always the most natural shape. I wouldn’t claim DiffusionGemma is ready to transform those workflows from my brief run, but the concept makes sense once you see it happening.

DiffusionGemma is experimental, early, and not something I’d compare to a conventional local LLM. Watching an answer denoise into place is strange, slightly distracting, and genuinely useful for understanding what Google is trying to do, while also making it easier than ever for anyone to understand what a diffusion model actually looks like in practice.

Tesco dumps thousands of servers off VMware, blames ‘abusive conduct’ from Broadcom

Midjourney Medical goes from AI image generation to full-body ultrasounds

AI is hurting Apple in more ways than one: it may force iPhone price increases

Paxton’s win over Cornyn sets up high-stakes Texas clash with Talarico

Global Resources Outlook 2024 | UNEP

Texas Democrat Talarico claims voting laws are rigged ahead of Paxton race

I tried Google’s new DiffusionGemma, and watching it generate text like an image is unlike any local LLM