What is Quantization?

Quantization reduces the numerical precision of model weights from 16-bit floating point to lower bit representations like 8-bit or 4-bit integers. This dramatically reduces VRAM usage and increases inference speed, with varying impact on output quality.

A 7B parameter model at FP16 uses ~14GB of VRAM. At 4-bit quantization (Q4_K_M), the same model needs only ~4.5GB — a 3x reduction that makes it runnable on consumer GPUs like the RTX 3060 12GB.

Quick VRAM Formula

Estimate VRAM for any model size and quantization level:

VRAM ≈ (Parameters × Bits per Weight) / 8 + Overhead

Example: 70B model at Q4 = (70B × 4) / 8 = 35GB + ~5GB overhead = ~40GB VRAM

Methods Compared

Method	Bits	VRAM (7B)	Quality	Speed
FP16	16	~14 GB	Baseline	1x
Q8_0	8	~7.5 GB	~99%	1.3x
Q5_K_M	5	~5.5 GB	~97%	1.5x
Q4_K_M	4	~4.5 GB	~94%	1.7x
GPTQ	4	~4.5 GB	~93%	1.8x (GPU)
AWQ	4	~4.5 GB	~95%	2x (GPU)
Q3_K_M	3	~3.5 GB	~88%	1.9x
Q2_K	2	~3 GB	~80%	2x

Format Deep Dive

GGUF (llama.cpp)

The most popular format for local inference. Supports CPU+GPU hybrid execution — offload layers to GPU while keeping the rest in RAM. Best for consumer hardware, CPU-only setups, mixed CPU/GPU.

Tools

llama.cpp, Ollama, LM Studio

Recommendation

Q4_K_M for best balance, Q5_K_M if you have VRAM to spare

GPTQ

GPU-only quantization using calibration data for optimal weight rounding. Requires the full model to fit in GPU VRAM. Best for dedicated GPU inference servers, batch processing.

Tools

ExLlamaV2, vLLM, text-generation-inference

Recommendation

4-bit with group_size=128 for best quality

AWQ (Activation-Aware)

Preserves the most important weights at higher precision based on activation patterns. Generally higher quality than GPTQ at the same bit width. Best for production inference, quality-sensitive applications.

Tools

vLLM, TensorRT-LLM, text-generation-inference

Recommendation

4-bit AWQ for production deployments

EXL2 (ExLlamaV2)

Variable bit-rate quantization — allocates more bits to important layers. Allows fine-grained control (e.g., 3.5 bits per weight). Best for maximizing quality at a specific VRAM budget.

Tools

ExLlamaV2, TabbyAPI

Recommendation

4.0-4.5 bpw for quality, 3.0-3.5 bpw for VRAM savings

Which Should You Use?

For Most Users: Q4_K_M (GGUF)

Best balance of quality and VRAM savings. Works with llama.cpp, Ollama, and LM Studio. Supports CPU+GPU hybrid inference.

For Production: AWQ

Higher quality than GPTQ at same size. Fast GPU inference via vLLM or TensorRT-LLM. Best for serving at scale.

For Quality: Q5_K_M or Q8_0

Minimal quality loss vs FP16. Use Q5_K_M if VRAM is tight, Q8_0 if you have headroom. Great for creative writing and reasoning.

For Tight VRAM: Q3_K_M or EXL2

Squeeze larger models into limited VRAM. EXL2 offers variable bit-rate for optimal quality/size tradeoff. Quality noticeably lower.

Key Takeaways

1.Q4_K_M gives ~94% of FP16 quality at ~32% of the VRAM — best default choice
2.GGUF format is most versatile (CPU, GPU, hybrid). GPTQ/AWQ are GPU-only but faster
3.Going below Q4 (3-bit, 2-bit) significantly degrades output quality
4.KV cache adds ~1-4GB overhead depending on context length — factor this into VRAM budget
5.Use our VRAM Calculator to check exact requirements for your model + GPU combo

What is Quantization?

A 7B parameter model at FP16 uses ~14GB of VRAM. At 4-bit quantization (Q4_K_M), the same model needs only ~4.5GB — a 3x reduction that makes it runnable on consumer GPUs like the RTX 3060 12GB.

Method

Bits

VRAM (7B)

Quality

Speed

FP16

~14 GB

Baseline

Q8_0

~7.5 GB

~99%

1.3x

Q5_K_M

~5.5 GB

~97%

1.5x

Q4_K_M

~4.5 GB

~94%

1.7x

GPTQ

~4.5 GB

~93%

1.8x (GPU)

AWQ

~4.5 GB

~95%

2x (GPU)

Q3_K_M

~3.5 GB

~88%

1.9x

Q2_K

~3 GB

~80%

Format Deep Dive

GGUF (llama.cpp)

The most popular format for local inference. Supports CPU+GPU hybrid execution — offload layers to GPU while keeping the rest in RAM. Best for consumer hardware, CPU-only setups, mixed CPU/GPU.

Tools

llama.cpp, Ollama, LM Studio

Recommendation

Q4_K_M for best balance, Q5_K_M if you have VRAM to spare

GPTQ

GPU-only quantization using calibration data for optimal weight rounding. Requires the full model to fit in GPU VRAM. Best for dedicated GPU inference servers, batch processing.

Tools

ExLlamaV2, vLLM, text-generation-inference

Recommendation

4-bit with group_size=128 for best quality

AWQ (Activation-Aware)

Tools

vLLM, TensorRT-LLM, text-generation-inference

Recommendation

4-bit AWQ for production deployments

EXL2 (ExLlamaV2)

Variable bit-rate quantization — allocates more bits to important layers. Allows fine-grained control (e.g., 3.5 bits per weight). Best for maximizing quality at a specific VRAM budget.

Tools

ExLlamaV2, TabbyAPI

Recommendation

4.0-4.5 bpw for quality, 3.0-3.5 bpw for VRAM savings

Which Should You Use?

For Most Users: Q4_K_M (GGUF)

Best balance of quality and VRAM savings. Works with llama.cpp, Ollama, and LM Studio. Supports CPU+GPU hybrid inference.

For Production: AWQ

Higher quality than GPTQ at same size. Fast GPU inference via vLLM or TensorRT-LLM. Best for serving at scale.

For Quality: Q5_K_M or Q8_0

Minimal quality loss vs FP16. Use Q5_K_M if VRAM is tight, Q8_0 if you have headroom. Great for creative writing and reasoning.

For Tight VRAM: Q3_K_M or EXL2

Squeeze larger models into limited VRAM. EXL2 offers variable bit-rate for optimal quality/size tradeoff. Quality noticeably lower.

Key Takeaways

1.Q4_K_M gives ~94% of FP16 quality at ~32% of the VRAM — best default choice

2.GGUF format is most versatile (CPU, GPU, hybrid). GPTQ/AWQ are GPU-only but faster

3.Going below Q4 (3-bit, 2-bit) significantly degrades output quality

4.KV cache adds ~1-4GB overhead depending on context length — factor this into VRAM budget

5.Use our VRAM Calculator to check exact requirements for your model + GPU combo

Quantization Guide

What is Quantization?

Quick VRAM Formula

Methods Compared

Format Deep Dive

GGUF (llama.cpp)

GPTQ

AWQ (Activation-Aware)

EXL2 (ExLlamaV2)

Which Should You Use?

For Most Users: Q4_K_M (GGUF)

For Production: AWQ

For Quality: Q5_K_M or Q8_0

For Tight VRAM: Q3_K_M or EXL2

Key Takeaways

Quantization Guide

What is Quantization?

Quick VRAM Formula

Methods Compared

Format Deep Dive

GGUF (llama.cpp)

GPTQ

AWQ (Activation-Aware)

EXL2 (ExLlamaV2)

Which Should You Use?

For Most Users: Q4_K_M (GGUF)

For Production: AWQ

For Quality: Q5_K_M or Q8_0

For Tight VRAM: Q3_K_M or EXL2

Key Takeaways