Aiman Ismail

Gemma 4 Quant Showdown: All Sizes, Every Format

This is an experiment — raw data and observations, not a polished write-up.

mlxapple-siliconinferencebenchmarkinggemmaquantizationm3-ultra

How do all of Gemma 4's quantization formats compare across all four model sizes on Apple Silicon? Running 33 variants — every instruction-tuned quant in the mlx-community namespace — on a single M3 Ultra.

Code: pokgak/mlx-bench

The Question

Gemma 4 ships with an unusually wide range of quantization options: standard integer quants (4–8bit, bf16), MX-spec formats (mxfp4, mxfp8), NVIDIA-style FP4 (nvfp4), and calibration-optimized OptiQ. Across four model sizes — two small MoE models (e2b, e4b), a large sparse MoE (26b-a4b), and a dense 31b — does the format choice matter? And does the architecture change which format wins?

Hardware & Setup

  • Machine: Mac Studio (2025)
  • Chip: Apple M3 Ultra
  • Memory: 512 GB unified, 819 GB/s theoretical bandwidth
  • Framework: mlx-lm 0.31.3 (git HEAD), MLX 0.31.1
  • Benchmark: 3 prompt lengths (128 / 512 / 1024 tokens), 256 generation tokens, 1 warmup + 3 runs, median reported
  • Metric: tokens/sec (generation), TTFT (ms), peak memory (GB)

Model Families

Each Gemma 4 size targets a different point on the speed/quality curve:

Family Architecture Active params What it's for
e2b MoE ~2B Max speed, edge deployment
e4b MoE ~4B Speed + slightly more capability
26b-a4b MoE ~4B active / 26B total Best quality-per-compute — large knowledge, cheap inference
31b Dense 31B Raw quality, all params active every token

Quantization Formats

Format Goal Trade-off
4bit Speed + memory efficiency Aggressive compression, some quality loss
5bit Speed/quality balance Rarely the sweet spot
6bit Quality-leaning balance Often best tok/s per quality point
8bit Near-lossless compression ~same quality as bf16, half the memory
bf16 Reference quality (training dtype) Slowest, largest — the baseline
mxfp4 Hardware-aligned 4-bit (MX spec) Block-level FP4 scaling; targets future accelerators
mxfp8 Hardware-aligned 8-bit (MX spec) Better than INT8 for non-uniform distributions
nvfp4 NVIDIA Blackwell FP4 format Different encoding from mxfp4, same idea
OptiQ-4bit Calibration-optimized 4bit Minimizes quant error via sample data; trades speed for quality-at-4bit

Results

gemma-4-e2b (MoE, ~2B active)

Quant tok/s (128→1024 tok prompt) Peak mem Notes
4bit 134–140 2.6–3.2 GB Fastest in entire benchmark
5bit 125–130 3.1–3.7 GB
mxfp4 115–119 3.2–3.7 GB Same speed as 6bit, hardware-aligned
6bit 115–119 3.6–4.2 GB
mxfp8 106–110 4.6–5.1 GB Tracks 8bit closely
8bit 108–112 4.7–5.3 GB
bf16 79–81 8.8–9.1 GB 43% slower than 4bit
nvfp4 Failed: model files missing on HF

gemma-4-e4b (MoE, ~4B active)

Quant tok/s Peak mem Notes
4bit 94–100 4.1–4.6 GB
OptiQ-4bit 86–92 6.0–6.5 GB Slower and heavier than plain 4bit
5bit 86–90 5.0–5.5 GB
6bit 78–81 5.8–6.3 GB
mxfp4 76–79 5.5–6.0 GB
nvfp4 76–79 5.6–6.0 GB Nearly identical to mxfp4
mxfp8 70–73 7.3–7.8 GB
8bit 71–74 7.5–8.0 GB
bf16 48–50 14.1–14.4 GB 2× slower than 4bit

gemma-4-26b-a4b (MoE, 26B total / ~4B active)

Quant tok/s Peak mem Notes
4bit 92–97 13.6–14.0 GB Matches e4b-4bit speed — MoE efficiency
5bit 84–88 16.4–16.8 GB
6bit 79–83 19.3–19.7 GB
8bit 74–77 25.0–25.4 GB
bf16 57–59 47.1–47.4 GB 38% slower than 4bit at 3.5× the memory
mxfp4 92–97 12.8–13.3 GB Ties 4bit speed, saves ~1 GB — best mxfp4 result in benchmark
mxfp8 73–76 24.3–24.7 GB Tracks 8bit, saves ~0.7 GB
nvfp4 89–94 13.6–13.9 GB Slightly behind mxfp4 at same memory

gemma-4-31b (Dense, 31B)

Quant tok/s Peak mem Notes
mxfp4 28–31 15.3–15.8 GB Beats 4bit speed, saves ~1 GB — consistent mxfp4 win on larger models
4bit 27–30 16.2–16.7 GB Dense tax: 3× slower than 26b-a4b-4bit
5bit 23–25 19.8–20.2 GB
6bit 21–22 23.3–23.8 GB
mxfp8 17–18 29.6–30.0 GB Matches 8bit speed, saves ~0.8 GB
8bit 17–18 30.5–30.9 GB Slower than e4b-bf16 — dense tax at full scale
nvfp4 27–30 16.2–16.6 GB Matches 4bit speed at same memory — no advantage
bf16 9.8–10.2 57.2–57.5 GB Slowest in the entire benchmark — 14× slower than e2b-4bit

Cross-Family Comparison

Best tok/s per family at the fastest and most memory-efficient configurations:

Family Best format tok/s Peak mem Runner-up
e2b 4bit 134–140 2.6–3.2 GB mxfp4 at 115–119 tok/s, same mem
e4b 4bit 94–100 4.1–4.6 GB 5bit at 86–90 tok/s
26b-a4b mxfp4 92–97 12.8–13.3 GB 4bit tied at 92–96 tok/s, +0.8 GB
31b mxfp4 28–31 15.3–15.8 GB 4bit at 27–30 tok/s, +1 GB

The 26b-a4b MoE model matches e4b on speed (92–97 vs 94–100 tok/s) while packing 26B parameters — you get dramatically more model capacity at essentially the same inference cost. That's the best performance-per-compute in the whole benchmark.


Observations

MoE speed is real. 26b-a4b-it-4bit hits 92–97 tok/s — nearly matching e4b-it-4bit (94–100 tok/s). You get 26B parameters of knowledge for roughly the same inference cost as a 4B dense model. That's the MoE promise delivered.

mxfp4 wins on larger models. For e2b it's roughly equivalent to 6bit (115–119 tok/s each). But on 26b-a4b and 31b, mxfp4 pulls ahead: it ties or beats 4bit on speed while using ~1 GB less memory. The compression efficiency of block-scaled FP4 pays off as model size grows.

nvfp4 ≈ mxfp4 on 26b-a4b, falls behind on 31b. At 26b-a4b they're essentially tied (92–97 vs 89–94 tok/s). On 31b, nvfp4 (27–30 tok/s) trails mxfp4 (28–31 tok/s) at the same memory footprint. Neither format has hardware acceleration on Apple Silicon — the difference is purely in encoding efficiency.

OptiQ-4bit is not worth it here. On e4b it's slower (86–92 vs 94–100 tok/s) and uses 50% more memory (6.0 GB vs 4.1 GB) compared to plain 4bit. The calibration-based quality improvement may matter on evals, but the speed and memory cost is steep.

bf16 penalty scales with model size. e2b-bf16 is 43% slower than e2b-4bit. e4b-bf16 is 51% slower. 26b-a4b-bf16 is 38% slower at 3.5× the memory. 31b-bf16 bottoms out at 9.8–10.2 tok/s — 14× slower than e2b-4bit, the widest gap in the entire benchmark.

TTFT is almost flat across quants. All e2b variants show ~32ms TTFT regardless of quantization. Prefill is fast enough that the quant format barely touches it — the difference is entirely in decode throughput.

The dense tax is brutal. 31b-it-4bit hits only 27–30 tok/s at 16.2 GB — roughly 3× slower than 26b-a4b-it-4bit (92–97 tok/s, 13.6 GB) despite similar total parameter counts. Every weight in the 31b gets loaded every token; the 26b-a4b routes each token through only ~4B active params out of 26B. At every memory budget, the MoE variants win on throughput.


Quality: BFCL Tool-Calling Evaluation

Speed tells half the story. To measure tool-calling quality, I ran the Berkeley Function Calling Leaderboard v4 on the top 5 configurations — picking the fastest quant per family plus the 26b-a4b-8bit as a quality reference point.

Models evaluated: e2b-it-4bit, e4b-it-4bit, 26b-a4b-it-mxfp4, 26b-a4b-it-8bit, 31b-it-4bit
Categories: simple (single function), multiple (pick from candidates), parallel (emit multiple calls)
Samples: 50 per category, Python subset

Results

Model Simple Multiple Parallel Avg
e2b-it-4bit 68% 36% 18% 40.7%
e4b-it-4bit 66% 36% 18% 40.0%
26b-a4b-it-mxfp4 70% 36% 14% 40.0%
26b-a4b-it-8bit 66% 36% 16% 39.3%
31b-it-4bit 68% 38% 18% 41.3%

Name accuracy (correct function selected, args may differ) was near-perfect across the board: 88–100% on simple, 92–98% on multiple, 92–98% on parallel.

Observations

Quality scaling is surprisingly flat. Going from e2b (2B active) to 31b (31B active) moves simple accuracy by 2 points (66→68%). The gap between best and worst is only ~2 points across all five models on every category.

Name accuracy is the easy part. Every model knows which function to call — 88–100% name accuracy across all categories. The ceiling is argument precision: getting every required argument correct in one shot.

31b-4bit is the best overall. Only model to break 36% on multiple (38%), and the only one with 96% name accuracy on both multiple and parallel. The extra capacity pays off in the details, not the headline number.

26b-a4b-mxfp4 is the best speed/quality trade-off. It leads on simple (70%), runs fastest (1.75s/sample vs 4–9s for 31b), and uses 13 GB vs 17 GB. If you're running tool-calling agents continuously on Apple Silicon, this is the pick.

Parallel tool calls are hard for all models. 14–18% exact match despite 92–98% name accuracy. Getting all arguments right across multiple simultaneous calls in one response is the real ceiling — no model comes close to solving it.


Failures

Model Error
gemma-4-e2b-it-nvfp4 FileNotFoundError — model files not yet available on HuggingFace