What is Quantization?
Quantization reduces the numerical precision of model weights from 16-bit floating point to lower bit representations like 8-bit or 4-bit integers. This dramatically reduces VRAM usage and increases inference speed, with varying impact on output quality.
A 7B parameter model at FP16 uses ~14GB of VRAM. At 4-bit quantization (Q4_K_M), the same model needs only ~4.5GB — a 3x reduction that makes it runnable on consumer GPUs like the RTX 3060 12GB.
Quick VRAM Formula
Estimate VRAM for any model size and quantization level:
Example: 70B model at Q4 = (70B × 4) / 8 = 35GB + ~5GB overhead = ~40GB VRAM
Methods Compared
| Method | Bits | VRAM (7B) | Quality | Speed |
|---|---|---|---|---|
| FP16 | 16 | ~14 GB | Baseline | 1x |
| Q8_0 | 8 | ~7.5 GB | ~99% | 1.3x |
| Q5_K_M | 5 | ~5.5 GB | ~97% | 1.5x |
| Q4_K_M | 4 | ~4.5 GB | ~94% | 1.7x |
| GPTQ | 4 | ~4.5 GB | ~93% | 1.8x (GPU) |
| AWQ | 4 | ~4.5 GB | ~95% | 2x (GPU) |
| Q3_K_M | 3 | ~3.5 GB | ~88% | 1.9x |
| Q2_K | 2 | ~3 GB | ~80% | 2x |
Format Deep Dive
GGUF (llama.cpp)
The most popular format for local inference. Supports CPU+GPU hybrid execution — offload layers to GPU while keeping the rest in RAM. Best for consumer hardware, CPU-only setups, mixed CPU/GPU.
llama.cpp, Ollama, LM Studio
Q4_K_M for best balance, Q5_K_M if you have VRAM to spare
GPTQ
GPU-only quantization using calibration data for optimal weight rounding. Requires the full model to fit in GPU VRAM. Best for dedicated GPU inference servers, batch processing.
ExLlamaV2, vLLM, text-generation-inference
4-bit with group_size=128 for best quality
AWQ (Activation-Aware)
Preserves the most important weights at higher precision based on activation patterns. Generally higher quality than GPTQ at the same bit width. Best for production inference, quality-sensitive applications.
vLLM, TensorRT-LLM, text-generation-inference
4-bit AWQ for production deployments
EXL2 (ExLlamaV2)
Variable bit-rate quantization — allocates more bits to important layers. Allows fine-grained control (e.g., 3.5 bits per weight). Best for maximizing quality at a specific VRAM budget.
ExLlamaV2, TabbyAPI
4.0-4.5 bpw for quality, 3.0-3.5 bpw for VRAM savings
Which Should You Use?
For Most Users: Q4_K_M (GGUF)
Best balance of quality and VRAM savings. Works with llama.cpp, Ollama, and LM Studio. Supports CPU+GPU hybrid inference.
For Production: AWQ
Higher quality than GPTQ at same size. Fast GPU inference via vLLM or TensorRT-LLM. Best for serving at scale.
For Quality: Q5_K_M or Q8_0
Minimal quality loss vs FP16. Use Q5_K_M if VRAM is tight, Q8_0 if you have headroom. Great for creative writing and reasoning.
For Tight VRAM: Q3_K_M or EXL2
Squeeze larger models into limited VRAM. EXL2 offers variable bit-rate for optimal quality/size tradeoff. Quality noticeably lower.
Key Takeaways
- 1.Q4_K_M gives ~94% of FP16 quality at ~32% of the VRAM — best default choice
- 2.GGUF format is most versatile (CPU, GPU, hybrid). GPTQ/AWQ are GPU-only but faster
- 3.Going below Q4 (3-bit, 2-bit) significantly degrades output quality
- 4.KV cache adds ~1-4GB overhead depending on context length — factor this into VRAM budget
- 5.Use our VRAM Calculator to check exact requirements for your model + GPU combo