Gemma 4 is a SOLID family of models - but harness and runner selection matter more than ever. Here's everything I learned from testing:
• The 31B model is amazing for proper thinking and chat. Agentic use is a mixed bag - see below
• A4B and 31B are good for visual understanding. Passes my personal MirandaBench (pass in a pattern/image, ask to separate into historical elements and reproduce with a nanobanana prompt - see video)
Harness matters a crazy amount:
• Codex - surprisingly - has been the most solid. codex --oss has been a massive boon.
• Claude Code barely works. The system prompt is too thick, and Gemma does not know what to do with interleaved thinking the way anthropic does it.
• OpenCode is worse - worse prompt (for this family), and overall worse toolcalls.
• Pi is pretty good - but adding in extensions will often confuse the model.
Runners matter too:
• LMStudio on Mac, llama.cpp on Windows are tested and working, but still have rought edges. I'd give these models a week or two to stabilize.
Quants:
• Q4 is.. okay. I've needed Q8 or above for any serious data work.
The more I test this model the more I'm sure that this is a solid agentic workhorse, but it's missing the harness and runner combo that would enable that. This is where I'm hoping the OSS community comes to the rescue.
As always, YMMV!