Gemma 4 Quant Showdown: All Sizes, Every Format
This is an experiment — raw data and observations, not a polished write-up.
How do all of Gemma 4's quantization formats compare across all four model sizes on Apple Silicon? Running 33 variants — every instruction-tuned quant in the mlx-community namespace — on a single M3 Ultra.
Code: pokgak/mlx-bench
The Question
Gemma 4 ships with an unusually wide range of quantization options: standard integer quants (4–8bit, bf16), MX-spec formats (mxfp4, mxfp8), NVIDIA-style FP4 (nvfp4), and calibration-optimized OptiQ. Across four model sizes — two small MoE models (e2b, e4b), a large sparse MoE (26b-a4b), and a dense 31b — does the format choice matter? And does the architecture change which format wins?
Hardware & Setup
- Machine: Mac Studio (2025)
- Chip: Apple M3 Ultra
- Memory: 512 GB unified, 819 GB/s theoretical bandwidth
- Framework: mlx-lm 0.31.3 (git HEAD), MLX 0.31.1
- Benchmark: 3 prompt lengths (128 / 512 / 1024 tokens), 256 generation tokens, 1 warmup + 3 runs, median reported
- Metric: tokens/sec (generation), TTFT (ms), peak memory (GB)
Model Families
Each Gemma 4 size targets a different point on the speed/quality curve:
| Family | Architecture | Active params | What it's for |
|---|---|---|---|
e2b |
MoE | ~2B | Max speed, edge deployment |
e4b |
MoE | ~4B | Speed + slightly more capability |
26b-a4b |
MoE | ~4B active / 26B total | Best quality-per-compute — large knowledge, cheap inference |
31b |
Dense | 31B | Raw quality, all params active every token |
Quantization Formats
| Format | Goal | Trade-off |
|---|---|---|
4bit |
Speed + memory efficiency | Aggressive compression, some quality loss |
5bit |
Speed/quality balance | Rarely the sweet spot |
6bit |
Quality-leaning balance | Often best tok/s per quality point |
8bit |
Near-lossless compression | ~same quality as bf16, half the memory |
bf16 |
Reference quality (training dtype) | Slowest, largest — the baseline |
mxfp4 |
Hardware-aligned 4-bit (MX spec) | Block-level FP4 scaling; targets future accelerators |
mxfp8 |
Hardware-aligned 8-bit (MX spec) | Better than INT8 for non-uniform distributions |
nvfp4 |
NVIDIA Blackwell FP4 format | Different encoding from mxfp4, same idea |
OptiQ-4bit |
Calibration-optimized 4bit | Minimizes quant error via sample data; trades speed for quality-at-4bit |
Results
gemma-4-e2b (MoE, ~2B active)
| Quant | tok/s (128→1024 tok prompt) | Peak mem | Notes |
|---|---|---|---|
| 4bit | 134–140 | 2.6–3.2 GB | Fastest in entire benchmark |
| 5bit | 125–130 | 3.1–3.7 GB | |
| mxfp4 | 115–119 | 3.2–3.7 GB | Same speed as 6bit, hardware-aligned |
| 6bit | 115–119 | 3.6–4.2 GB | |
| mxfp8 | 106–110 | 4.6–5.1 GB | Tracks 8bit closely |
| 8bit | 108–112 | 4.7–5.3 GB | |
| bf16 | 79–81 | 8.8–9.1 GB | 43% slower than 4bit |
| nvfp4 | — | — | Failed: model files missing on HF |
gemma-4-e4b (MoE, ~4B active)
| Quant | tok/s | Peak mem | Notes |
|---|---|---|---|
| 4bit | 94–100 | 4.1–4.6 GB | |
| OptiQ-4bit | 86–92 | 6.0–6.5 GB | Slower and heavier than plain 4bit |
| 5bit | 86–90 | 5.0–5.5 GB | |
| 6bit | 78–81 | 5.8–6.3 GB | |
| mxfp4 | 76–79 | 5.5–6.0 GB | |
| nvfp4 | 76–79 | 5.6–6.0 GB | Nearly identical to mxfp4 |
| mxfp8 | 70–73 | 7.3–7.8 GB | |
| 8bit | 71–74 | 7.5–8.0 GB | |
| bf16 | 48–50 | 14.1–14.4 GB | 2× slower than 4bit |
gemma-4-26b-a4b (MoE, 26B total / ~4B active)
| Quant | tok/s | Peak mem | Notes |
|---|---|---|---|
| 4bit | 92–97 | 13.6–14.0 GB | Matches e4b-4bit speed — MoE efficiency |
| 5bit | 84–88 | 16.4–16.8 GB | |
| 6bit | 79–83 | 19.3–19.7 GB | |
| 8bit | 74–77 | 25.0–25.4 GB | |
| bf16 | 57–59 | 47.1–47.4 GB | 38% slower than 4bit at 3.5× the memory |
| mxfp4 | 92–97 | 12.8–13.3 GB | Ties 4bit speed, saves ~1 GB — best mxfp4 result in benchmark |
| mxfp8 | 73–76 | 24.3–24.7 GB | Tracks 8bit, saves ~0.7 GB |
| nvfp4 | 89–94 | 13.6–13.9 GB | Slightly behind mxfp4 at same memory |
gemma-4-31b (Dense, 31B)
| Quant | tok/s | Peak mem | Notes |
|---|---|---|---|
| mxfp4 | 28–31 | 15.3–15.8 GB | Beats 4bit speed, saves ~1 GB — consistent mxfp4 win on larger models |
| 4bit | 27–30 | 16.2–16.7 GB | Dense tax: 3× slower than 26b-a4b-4bit |
| 5bit | 23–25 | 19.8–20.2 GB | |
| 6bit | 21–22 | 23.3–23.8 GB | |
| mxfp8 | 17–18 | 29.6–30.0 GB | Matches 8bit speed, saves ~0.8 GB |
| 8bit | 17–18 | 30.5–30.9 GB | Slower than e4b-bf16 — dense tax at full scale |
| nvfp4 | 27–30 | 16.2–16.6 GB | Matches 4bit speed at same memory — no advantage |
| bf16 | 9.8–10.2 | 57.2–57.5 GB | Slowest in the entire benchmark — 14× slower than e2b-4bit |
Cross-Family Comparison
Best tok/s per family at the fastest and most memory-efficient configurations:
| Family | Best format | tok/s | Peak mem | Runner-up |
|---|---|---|---|---|
| e2b | 4bit |
134–140 | 2.6–3.2 GB | mxfp4 at 115–119 tok/s, same mem |
| e4b | 4bit |
94–100 | 4.1–4.6 GB | 5bit at 86–90 tok/s |
| 26b-a4b | mxfp4 |
92–97 | 12.8–13.3 GB | 4bit tied at 92–96 tok/s, +0.8 GB |
| 31b | mxfp4 |
28–31 | 15.3–15.8 GB | 4bit at 27–30 tok/s, +1 GB |
The 26b-a4b MoE model matches e4b on speed (92–97 vs 94–100 tok/s) while packing 26B parameters — you get dramatically more model capacity at essentially the same inference cost. That's the best performance-per-compute in the whole benchmark.
Observations
MoE speed is real. 26b-a4b-it-4bit hits 92–97 tok/s — nearly matching e4b-it-4bit (94–100 tok/s). You get 26B parameters of knowledge for roughly the same inference cost as a 4B dense model. That's the MoE promise delivered.
mxfp4 wins on larger models. For e2b it's roughly equivalent to 6bit (115–119 tok/s each). But on 26b-a4b and 31b, mxfp4 pulls ahead: it ties or beats 4bit on speed while using ~1 GB less memory. The compression efficiency of block-scaled FP4 pays off as model size grows.
nvfp4 ≈ mxfp4 on 26b-a4b, falls behind on 31b. At 26b-a4b they're essentially tied (92–97 vs 89–94 tok/s). On 31b, nvfp4 (27–30 tok/s) trails mxfp4 (28–31 tok/s) at the same memory footprint. Neither format has hardware acceleration on Apple Silicon — the difference is purely in encoding efficiency.
OptiQ-4bit is not worth it here. On e4b it's slower (86–92 vs 94–100 tok/s) and uses 50% more memory (6.0 GB vs 4.1 GB) compared to plain 4bit. The calibration-based quality improvement may matter on evals, but the speed and memory cost is steep.
bf16 penalty scales with model size. e2b-bf16 is 43% slower than e2b-4bit. e4b-bf16 is 51% slower. 26b-a4b-bf16 is 38% slower at 3.5× the memory. 31b-bf16 bottoms out at 9.8–10.2 tok/s — 14× slower than e2b-4bit, the widest gap in the entire benchmark.
TTFT is almost flat across quants. All e2b variants show ~32ms TTFT regardless of quantization. Prefill is fast enough that the quant format barely touches it — the difference is entirely in decode throughput.
The dense tax is brutal. 31b-it-4bit hits only 27–30 tok/s at 16.2 GB — roughly 3× slower than 26b-a4b-it-4bit (92–97 tok/s, 13.6 GB) despite similar total parameter counts. Every weight in the 31b gets loaded every token; the 26b-a4b routes each token through only ~4B active params out of 26B. At every memory budget, the MoE variants win on throughput.
Quality: BFCL Tool-Calling Evaluation
Speed tells half the story. To measure tool-calling quality, I ran the Berkeley Function Calling Leaderboard v4 on the top 5 configurations — picking the fastest quant per family plus the 26b-a4b-8bit as a quality reference point.
Models evaluated: e2b-it-4bit, e4b-it-4bit, 26b-a4b-it-mxfp4, 26b-a4b-it-8bit, 31b-it-4bit
Categories: simple (single function), multiple (pick from candidates), parallel (emit multiple calls)
Samples: 50 per category, Python subset
Results
| Model | Simple | Multiple | Parallel | Avg |
|---|---|---|---|---|
e2b-it-4bit |
68% | 36% | 18% | 40.7% |
e4b-it-4bit |
66% | 36% | 18% | 40.0% |
26b-a4b-it-mxfp4 |
70% | 36% | 14% | 40.0% |
26b-a4b-it-8bit |
66% | 36% | 16% | 39.3% |
31b-it-4bit |
68% | 38% | 18% | 41.3% |
Name accuracy (correct function selected, args may differ) was near-perfect across the board: 88–100% on simple, 92–98% on multiple, 92–98% on parallel.
Observations
Quality scaling is surprisingly flat. Going from e2b (2B active) to 31b (31B active) moves simple accuracy by 2 points (66→68%). The gap between best and worst is only ~2 points across all five models on every category.
Name accuracy is the easy part. Every model knows which function to call — 88–100% name accuracy across all categories. The ceiling is argument precision: getting every required argument correct in one shot.
31b-4bit is the best overall. Only model to break 36% on multiple (38%), and the only one with 96% name accuracy on both multiple and parallel. The extra capacity pays off in the details, not the headline number.
26b-a4b-mxfp4 is the best speed/quality trade-off. It leads on simple (70%), runs fastest (1.75s/sample vs 4–9s for 31b), and uses 13 GB vs 17 GB. If you're running tool-calling agents continuously on Apple Silicon, this is the pick.
Parallel tool calls are hard for all models. 14–18% exact match despite 92–98% name accuracy. Getting all arguments right across multiple simultaneous calls in one response is the real ceiling — no model comes close to solving it.
Failures
| Model | Error |
|---|---|
gemma-4-e2b-it-nvfp4 |
FileNotFoundError — model files not yet available on HuggingFace |