Post

A Quantization Primer: Formats, Architecture Sensitivity, and a Gemma 4 Case Study

A 70B parameter model in FP16 weighs 140GB. That won’t fit on any consumer GPU. Quantization is how you shrink it to 35GB (Q4) or even 18GB (Q2) — trading precision for the ability to actually run the model. But quantization isn’t just “make numbers smaller.” Different methods work differently on different architectures, and the wrong choice can turn a brilliant model into an incoherent mess.

This post covers the most widely used quantization formats in production today, explains why model architecture determines which quantizations work well, and uses Gemma 4 12B as a concrete case study with real file sizes from Bartowski’s GGUF quantizations.

Not covered in detail: HQQ (zero-shot, no calibration), AQLM (vector quantization for extreme 2-bit), QuIP# (incoherence-based), SmoothQuant (activation-weight co-quantization), Marlin (vLLM’s fast 4-bit kernel), and compressed-tensors (Neural Magic/vLLM native format). These are worth exploring if you’re pushing the boundaries of quantization research.


What Quantization Actually Does

Neural network weights are stored as numbers. The question is: how many bits per number?

PrecisionBitsBytes per weight12B model sizeDescription
FP32324.048 GBFull precision. Training default.
BF16162.024 GBBrain Float 16. Same range as FP32, less precision. Training/inference standard.
FP16162.024 GBHalf precision. Slightly different range than BF16.
FP8 (E4M3)81.012 GB4-bit exponent, 3-bit mantissa. Native on Hopper/Ada GPUs.
INT881.012 GBInteger quantization. Requires calibration.
INT4/Q440.56 GBThe sweet spot for consumer hardware.
Q220.253 GBExtreme compression. Quality degrades noticeably.

This table is a simplification. Real quantization isn’t uniform — different parts of the model get different bit-widths.


The Two Quantization Paradigms

Post-Training Quantization (PTQ)

Take a pre-trained model and convert its weights to lower precision after training. No retraining needed. This is what most quantizations you download are.

flowchart LR
    FP["Full precision\nmodel (BF16)"] -->|"Quantize\n(minutes to hours)"| Q["Quantized model\n(Q4, FP8, etc.)"]
    Q -->|"Run inference"| OUT["Output"]

Quantization-Aware Training (QAT)

Train the model knowing it will be quantized. The model learns to be robust to precision loss during training itself. Better quality at the same bit-width, but requires retraining.

flowchart LR
    DATA["Training data"] -->|"Train with\nsimulated quantization"| QAT["QAT model\n(FP16 with quant noise)"]
    QAT -->|"Convert"| Q["Quantized model"]
    Q -->|"Run inference"| OUT["Output"]

QAT models are rarer because retraining is expensive. Most quantized models you find on HuggingFace are PTQ.


Every Major Quantization Format

GGUF (llama.cpp) — The CPU/Edge Standard

GGUF is the format used by llama.cpp, LM Studio, Ollama, and most local inference tools. It uses a mix of k-quants and i-quants — different algorithms for converting weights to lower precision.

K-Quants (Q-series)

K-quants use block quantization: weights are grouped into blocks (typically 32 or 256 values), and each block gets its own scale factor and zero point. The “K” denotes mixed precision across different layer types.

BPW (Bits Per Weight) is the average number of bits used per parameter across the entire model. Since K-quants apply different precision to different layers (attention might get 5-bit while MLPs get 4-bit), the BPW is an average. It’s the most useful size metric: file size ≈ (num_params × BPW) / 8.

FormatAvg BPWStrategyQuality
Q8_08.58-bit per blockNear-lossless. Unnecessary unless you need maximum quality.
Q6_K6.66-bit with mixed precisionVery high quality, near-perfect.
Q5_K_M5.75-bit, attention/MLP get different precisionHigh quality. Great balance.
Q5_K_S5.55-bit, uniformHigh quality.
Q4_K_M4.84-bit, sensitive layers get 5-bitThe default recommendation. Best quality/size tradeoff for most users.
Q4_K_S4.64-bit, uniformGood quality with more space savings.
Q3_K_M3.93-bit, mixedNoticeable quality loss. Use only if memory-constrained.
Q3_K_S3.53-bit, uniformLow quality, not recommended for most uses.
Q2_K3.32-bit with higher-precision scalesVery low quality. Surprisingly usable for simple tasks.

The “M” in Q4_K_M means “medium” — attention layers and the first/last layers get slightly higher precision than MLP layers. “S” means “small” — all layers get the same (lower) precision. “L” variants (Q4_K_L, Q6_K_L) use Q8_0 specifically for the embedding and output layers, which are particularly sensitive.

I-Quants (IQ-series)

I-quants use importance-aware quantization — the importance matrix (imatrix) determines which weights matter most. They achieve better quality at the same bit-width as k-quants, especially below 4-bit.

FormatAvg BPWWhen to use
IQ4_XS4.3Similar to Q4_K_S but better quality at the cost of slower CPU inference
IQ4_NL4.5Non-linear quantization. Supports ARM online repacking.
IQ3_M3.4Comparable to Q3_K_M quality in a smaller file
IQ3_XS3.3Better than Q3_K_S at similar size
IQ3_XXS3.1Extreme 3-bit compression
IQ2_M2.7Uses state-of-the-art techniques. Surprisingly usable.
IQ2_S2.5Minimum viable quantization for coherent output.

Rule of thumb: Use K-quants above 4 bits. Below 4 bits, I-quants are meaningfully better, especially on GPU (cuBLAS/rocBLAS). On CPU, I-quants are slower than K-quants at the same bit-width — speed vs quality tradeoff.

The Importance Matrix (imatrix)

The importance matrix is the key innovation that makes aggressive quantization viable. Here’s how it works:

  1. Calibrate: Run a diverse text dataset through the full-precision model
  2. Measure: For each weight, compute how much changing that weight affects the model’s output (via activation statistics)
  3. Prioritize: Weights with high importance scores get more bits; unimportant weights get fewer
flowchart LR
    DATASET["Calibration dataset\n(wiki + code + multilingual)"] --> FP["Full precision model"]
    FP --> STATS["Per-weight importance\nscores"]
    STATS --> QUANT["Quantizer allocates\nbits by importance"]
    QUANT --> GGUF["GGUF file"]

All of Bartowski’s GGUF quantizations use imatrix calibration. The calibration dataset is publicly available and includes a mix of English text, code, and diverse content.

Why calibration data matters: If you calibrate on English Wikipedia, the importance matrix optimizes for English prose. Code-related weights might get deprioritized. A code model calibrated on prose will perform worse on code than one calibrated on a mixed dataset.


GPTQ — GPU-Optimized Post-Training Quantization

GPTQ (GPT Quantized) uses a second-order method based on the Hessian matrix to minimize the quantization error. Instead of just rounding weights to the nearest quantized value, it compensates: when one weight is rounded down, neighboring weights are adjusted upward to reduce the overall error.

1
2
3
4
5
6
Naive rounding:     weight 0.73 → 0.75 (nearest Q4 value)
                    Error accumulates independently per weight

GPTQ:              weight 0.73 → 0.75 (round)
                    weight 0.81 → 0.79 (compensate for previous rounding)
                    Error is minimized across the layer as a whole
PropertyDetails
FormatSafetensors with quantization config
Bit-widths2, 3, 4, 8-bit
RuntimeGPU only (exllama, AutoGPTQ, vLLM, TGI)
CalibrationRequired (128+ samples from a text dataset, typically C4 — Google’s Colossal Clean Crawled Corpus of web text)
SpeedFast on GPU thanks to optimized CUDA kernels
QualityVery good at 4-bit. Better than naive rounding, comparable to AWQ.

AWQ — Activation-Aware Weight Quantization

AWQ’s insight: not all weights are equally important. A small fraction of weights (~1%) are salient — they correspond to large activation values and disproportionately affect the output. AWQ identifies these salient weights and protects them:

flowchart LR
    subgraph "AWQ Process"
        ACT["Measure activation\nmagnitudes on\ncalibration data"] --> SAL["Identify salient\nchannels (top 1%)"]
        SAL --> SCALE["Apply per-channel\nscaling to protect\nsalient weights"]
        SCALE --> Q["Quantize everything\nto 4-bit"]
    end

Instead of keeping salient weights at higher precision (which would require mixed-precision kernels), AWQ scales the weight matrix so that salient weights fall in a range where quantization error is minimal. Clever math trick that achieves protection without format complexity.

PropertyDetails
FormatSafetensors with AWQ config
Bit-widths4-bit (primary), 3-bit experimental
RuntimeGPU only (vLLM, TGI, AutoAWQ)
SpeedGenerally faster than GPTQ at same quality
QualityExcellent at 4-bit. Often slightly better than GPTQ, especially on smaller models.

FP8 — Hardware-Native 8-Bit Float

FP8 isn’t a software quantization trick — it’s a hardware data type supported natively by NVIDIA Hopper (H100), Ada (RTX 4090), and newer GPUs. Two variants:

FormatExponentMantissaRangePrecisionUse Case
E4M34 bits3 bits±2408 levels of precisionWeights and activations
E5M25 bits2 bits±57,3444 levels of precisionGradients (wider range needed)

Because it’s a hardware type, FP8 operations run at 2× the throughput of FP16 on H100 Tensor Cores. No quantization error distribution tricks needed — the hardware computes in FP8 directly.

PropertyDetails
RuntimeNVIDIA Hopper/Ada/Blackwell GPUs, vLLM, TensorRT-LLM
Quality<1% degradation on most benchmarks
Speed2× throughput vs FP16 on supported hardware
CalibrationStatic (pre-computed scales) or dynamic (per-tensor at runtime)

NVFP4 & MXFP4 — 4-Bit Float (Blackwell)

NVIDIA’s Blackwell architecture (B100, B200, GB10/DGX Spark) introduces native FP4 support via microscaling formats:

  • NVFP4: NVIDIA’s proprietary 4-bit float. Each group of weights shares a scaling factor.
  • MXFP4/MXFP8: Open Microscaling formats (OCP standard). Groups of 32 elements share a scale stored in FP8.

These offer 4× throughput vs FP16 on Blackwell hardware. vLLM supports both.


EXL2 — Mixed-Precision for ExLlamaV2

EXL2 is the quantization format for ExLlamaV2. Its killer feature: arbitrary bits-per-weight (BPW) with per-layer optimization.

Instead of quantizing the entire model to Q4 or Q5, EXL2 lets you specify a target average BPW (e.g., 4.5) and then allocates bits optimally across layers. Sensitive layers get more bits, tolerant layers get fewer.

PropertyDetails
RuntimeExLlamaV2 only (GPU)
BPW range2.0 to 8.0, arbitrary granularity
CalibrationRequired, uses perplexity measurement per layer
SpeedVery fast on NVIDIA GPUs, optimized CUDA kernels
QualityBest-in-class at any given BPW target due to per-layer optimization

BitsAndBytes — Integration-First Quantization

BitsAndBytes is the most widely used quantization library in the HuggingFace ecosystem, primarily for fine-tuning (QLoRA) rather than inference.

FormatBitsDescription
INT8 (LLM.int8())8Mixed-precision decomposition. Outlier features stay in FP16.
NF4 (Normal Float 4)44-bit data type optimized for normally-distributed weights. Used by QLoRA.
FP44Standard 4-bit float.

BitsAndBytes is slower for inference than GPTQ/AWQ because it doesn’t have dedicated inference kernels. Its strength is enabling 4-bit fine-tuning (QLoRA) where you train LoRA adapters on a frozen 4-bit base model.


Why Architecture Matters for Quantization

Here’s the part most guides skip: the same quantization format performs differently on different model architectures. This isn’t a minor effect — it can be the difference between “works great” and “useless.”

1. Embedding and Output Layers Are Uniquely Sensitive

The embedding layer maps discrete token IDs to continuous vectors. The output (lm_head) layer maps hidden states back to vocabulary logits. These layers are qualitatively different from transformer layers:

  • Vocabulary is huge (262,144 tokens in Gemma 4). Each row must distinguish one token from 262,143 others.
  • Small errors in output logits shift probability mass between tokens, directly corrupting generation.
  • These layers can’t be “averaged out” — there’s no redundancy.

This is why Bartowski provides _L variants (Q3_K_XL, Q4_K_L, Q6_K_L) that keep embedding/output weights at Q8_0 while quantizing everything else more aggressively. For Gemma 4 12B, the difference between Q3_K_L (6.65GB) and Q3_K_XL (6.90GB) is 250MB — the cost of protecting those two layers.

2. Attention vs MLP Sensitivity

In K-quant naming, the “M” (medium) variants give attention layers slightly higher precision than MLP layers. This reflects an empirical finding: attention weights are more sensitive to quantization than MLP weights.

Why? Attention weights (Q, K, V projections) directly compute what the model “looks at.” A quantization error in Q·K^T distorts which tokens get attended to — and this error propagates through every subsequent layer. MLP weights transform representations locally — errors in one MLP don’t cascade as severely.

3. Sliding Window vs Global Attention Layers

This is Gemma-specific but generalizable. Gemma 4 has two types of layers:

  • Sliding window layers (40 of 48) — attend to last 1024 tokens
  • Global layers (8 of 48) — attend to ALL tokens

Global layers are more quantization-sensitive because:

  • They handle long-range information retrieval (system prompt recall, early context)
  • Errors in global layers can’t be compensated by later layers — there are only 8 of them
  • They use MQA (single KV head) with larger head dimensions (512 vs 256), meaning each weight carries proportionally more information

4. MoE Models Need Special Care

In Mixture-of-Experts models (DeepSeek V3, Mixtral), only 2-8 of 256 experts activate per token. During calibration:

  • Popular experts see lots of calibration data → accurate importance scores
  • Rare experts see little data → noisy importance scores → worse quantization

This means MoE models can have expert-dependent quality degradation: most of the model quantizes well, but the rarely-activated experts (which handle niche knowledge) degrade disproportionately.

5. K=V Sharing Changes Quantization Math

Gemma 4 12B uses attention_k_eq_v = true — K and V are the same tensor. This means quantizing K implicitly quantizes V with the same error. In models where K and V are separate, their quantization errors are independent and can partially cancel out in the attention output. With K=V sharing, they correlate perfectly, amplifying the error.

This doesn’t mean K=V sharing is bad — it halves the KV cache. But it means Gemma 4 may be slightly more sensitive to attention weight quantization than models with separate K and V projections.


Case Study: Gemma 4 12B Quantization

Available Quantizations

From Bartowski’s GGUF repo, all quantized with imatrix:

FormatFile SizeBPWCompressionQuality Assessment
BF1623.83 GB16.0Full precision baseline
Q8_012.67 GB8.51.9×Near-lossless. Generally unnecessary.
Q6_K_L10.48 GB6.62.3×Embed/output at Q8_0. Near-perfect. ✅ Recommended
Q6_K10.24 GB6.62.3×Near-perfect. ✅ Recommended
Q5_K_L9.02 GB5.72.6×Embed/output at Q8_0. High quality. ✅ Recommended
Q5_K_M8.77 GB5.72.7×High quality. ✅ Recommended
Q4_K_L7.91 GB4.83.0×Embed/output at Q8_0. Good quality. ✅ Recommended
Q4_K_M7.66 GB4.83.1×Default recommendation. Best balance. ✅
Q4_K_S7.17 GB4.63.3×Slightly lower quality, more savings. ✅
IQ4_XS6.78 GB4.33.5×Better quality than Q4_K_S at smaller size. ✅
Q3_K_XL6.90 GB3.93.5×Embed/output at Q8_0. Usable.
Q3_K_L6.65 GB3.93.6×Usable. Good for low RAM.
Q3_K_M6.30 GB3.93.8×Low quality.
IQ3_M5.97 GB3.44.0×Better than Q3_K_M at smaller size.
Q2_K_L5.32 GB3.34.5×Embed/output at Q8_0. Very low but usable.
Q2_K5.08 GB3.34.7×Very low quality. Surprisingly usable for simple tasks.
IQ2_M4.94 GB2.74.8×SOTA techniques. Surprisingly usable.
IQ2_S4.71 GB2.55.1×Minimum viable coherence.

Which Quantization for Gemma 4 12B?

The answer depends on your hardware:

HardwareVRAM/RAMRecommended QuantReasoning
RTX 4090 (24GB)24 GB VRAMQ6_K_L (10.48 GB)Fits with room for 8K+ context KV cache
RTX 3090/4080 (16GB)16 GB VRAMQ4_K_M (7.66 GB)Fits with room for moderate context
DGX Spark / GB10 (128GB unified)128 GBBF16 (23.83 GB)Plenty of room, no need to quantize
Apple M3 Max (96GB)96 GB unifiedQ8_0 or BF16Unified memory, run full precision
Apple M4 (16GB)16 GB unifiedQ4_K_M (7.66 GB)Tight fit, Q4 is the sweet spot
CPU only (32GB RAM)32 GBQ4_K_M (7.66 GB)K-quants are faster than I-quants on CPU
CPU only (16GB RAM)16 GBIQ3_M (5.97 GB)I-quants give better quality at this size

Gemma 4 Architecture Considerations

Gemma 4 12B has several architectural features that affect quantization decisions:

  1. Large vocabulary (262K tokens): The embedding layer is proportionally large. The _L variants that protect it at Q8_0 are worth the ~300MB premium.

  2. 48 layers with 5:1 local:global ratio: Only 8 global layers handle long-range attention. These 8 layers should ideally be protected at higher precision (which is what the “M” variants do — mixed precision across layer types).

  3. K=V sharing: As discussed above, quantization errors in attention weights are amplified because K and V share the same quantized tensor. This makes Gemma 4 slightly more sensitive to attention quantization than models with separate K/V.

  4. Multimodal (vision encoder): The vision encoder has separate weights. GGUF quantization typically handles this, but check that your runtime supports Gemma 4’s multimodal architecture.

Practical recommendation: For Gemma 4 12B on consumer hardware, start with Q4_K_M (7.66 GB). If you notice quality issues on reasoning or long-context tasks, step up to Q5_K_M (8.77 GB). If you have the VRAM, Q6_K_L (10.48 GB) is near-indistinguishable from full precision and protects the large embedding layer.


GPU vs CPU Quantization Considerations

FactorGPU (CUDA/ROCm)CPU (AVX2/ARM)
Best formatGPTQ, AWQ, or FP8 for dedicated GPU; GGUF for llama.cppGGUF (K-quants for speed, I-quants for quality)
Below 4-bitI-quants strongly preferred (cuBLAS optimized)K-quants faster; I-quants slower but better quality
FP8Native on Hopper/Ada, 2× throughputNot applicable
Q4_0 vs Q4_K_MQ4_K_M better qualityQ4_0 supports online repacking for ARM NEON (faster)
Key bottleneckMemory bandwidth (reading weights from VRAM)Memory bandwidth (reading weights from RAM)

KV Cache Quantization: Separate from Weight Quantization

Weight quantization is a one-time conversion. KV cache quantization happens during inference — the KV vectors generated at each step are quantized on-the-fly.

MethodCompressionQuality ImpactWhere Supported
FP16 KV (default)BaselineEverywhere
FP8 KV<1% degradationvLLM (Hopper+), TensorRT-LLM
INT4 KV (TurboQuant)Noticeable on needle-in-haystackvLLM (experimental)
INT2 KV (TurboQuant)Model-dependent, under researchvLLM (experimental)

KV cache quantization is independent of weight quantization. You can run Q4_K_M weights with FP16 KV cache, or BF16 weights with FP8 KV cache. They compress different things:

1
2
3
Total GPU memory = model weights (quantized once) 
                 + KV cache (quantized during inference) 
                 + activation memory (not quantized)

For Gemma 4 12B at 128K context (from our attention deep dive):

  • Weights: 7.66 GB (Q4_K_M)
  • KV cache: ~2.4 GB (FP16) → ~1.2 GB with FP8 KV
  • FP8 KV cache is especially valuable for Gemma 4 because its sliding window layers already minimize KV cache, so the remaining global-layer KV is the expensive part — and that’s exactly what FP8 compresses.

The Full Quantization Decision Tree

flowchart TD
    START["What's your hardware?"] --> GPU{"NVIDIA GPU\n(which generation?)"}
    START --> CPU{"CPU / Apple\nSilicon?"}

    GPU -->|"Hopper/Ada\n(H100, A100, 4090)"| FP8["Use FP8 (native)\nor AWQ 4-bit"]
    GPU -->|"Blackwell\n(B100, B200, GB10)"| NVFP4["Use NVFP4\nor FP8"]
    GPU -->|"Ampere/Older\n(3090, A6000)"| GPU_FORMAT{"How much VRAM?"}

    GPU_FORMAT -->|"> model size"| AWQ_GPTQ["AWQ or GPTQ 4-bit\n(fast GPU kernels)"]
    GPU_FORMAT -->|"< model size"| GGUF_GPU["GGUF Q4_K_M\nwith GPU offload"]

    CPU -->|"Lots of RAM\n(64GB+)"| GGUF_HIGH["GGUF Q6_K or Q8_0\n(K-quants for speed)"]
    CPU -->|"Limited RAM\n(16-32GB)"| GGUF_LOW["GGUF Q4_K_M\nor IQ4_XS"]
    CPU -->|"Very limited\n(<16GB)"| GGUF_TINY["GGUF IQ3_M or IQ2_M\n(I-quants for quality)"]

Summary: Format Selection Guide

FormatBest ForRuntimeGPU Required?Quality
GGUF (K-quants)Local inference, CPU, mixed CPU+GPUllama.cpp, Ollama, LM StudioNoGood to excellent
GGUF (I-quants)Aggressive compression (<4 bit) on GPUllama.cpp with cuBLASRecommendedBetter than K-quants at same BPW
GPTQGPU inference serversvLLM, TGI, ExLlamaYesVery good
AWQGPU inference, slightly faster than GPTQvLLM, TGI, AutoAWQYesVery good to excellent
EXL2Maximum quality at any target BPWExLlamaV2 onlyYesBest-in-class
FP8Production serving on Hopper+vLLM, TensorRT-LLMHopper/Ada+Near-lossless
NVFP4Blackwell hardwarevLLM, TensorRT-LLMBlackwellGood (4× throughput)
BitsAndBytesFine-tuning (QLoRA), not inferenceTransformersYesGood for training

Further Reading

This post is licensed under CC BY 4.0 by the author.