Last updated: April 5, 2026 · Model Architecture · by Daniel Ashford

What is Quantization?

QUICK ANSWER

Compressing an LLM to use less memory by reducing numerical precision.

Definition

Quantization is a compression technique that reduces the numerical precision of model parameters from 16-bit floating point to lower-precision formats (8-bit, 4-bit, or 2-bit). This dramatically reduces memory requirements and can speed up inference.

How It Works

A 70B parameter model in 16-bit precision requires approximately 140GB of memory. Quantized to 4-bit, it fits in approximately 35GB — runnable on fewer GPUs. 8-bit quantization preserves 99%+ quality, while aggressive 2-bit can cause noticeable degradation. Popular formats include GGUF, GPTQ, and AWQ.

Example

Running Llama 4 405B at full precision requires 8x A100 GPUs. Quantized to 4-bit, it can run on 2-3x A100 GPUs with minimal quality loss.

Related Terms

Parameters
The numerical weights inside an LLM that encode its learned knowledge.
GPU (Graphics Processing Unit)
The specialized hardware that LLMs run on.
Self-Hosting
Running an LLM on your own hardware instead of using a cloud API.
VRAM
The GPU memory that determines which models can run on which hardware.

See How Models Compare

Understanding quantization is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated