The Key Bottleneck: VRAM
GPU VRAM is the primary constraint for running LLMs locally. The model weights, KV cache, and activation memory must all fit in VRAM for GPU inference. If they don't fit, you can offload layers to system RAM (much slower) or use quantization to shrink the model.
Quick rule of thumb for VRAM requirements:
Add 1-4 GB for KV cache overhead depending on context length (2K-128K tokens)
Requirements by Model Size
| Size | VRAM (FP16) | VRAM (Q4) | System RAM | Min GPU |
|---|---|---|---|---|
| 1-3B | 2-6 GB | 1-2 GB | 8 GB | GTX 1060 6GB+ |
| 7-8B | 14-16 GB | 4-5 GB | 16 GB | RTX 3060 12GB |
| 13-14B | 26-28 GB | 8-9 GB | 32 GB | RTX 4070 Ti 16GB |
| 32-34B | 64-68 GB | 18-20 GB | 32 GB | RTX 3090 24GB |
| 70B | 140 GB | 35-40 GB | 64 GB | 2x RTX 3090 or A100 80GB |
| 120-180B | 240-360 GB | 65-100 GB | 128 GB | 4x A100 80GB or 8x RTX 3090 |
| 405B | 810 GB | ~220 GB | 256 GB | 8x A100 80GB+ |
VRAM figures include ~10% overhead for KV cache at 2K context. Longer contexts require more VRAM.
GPU Tiers for Local AI
Entry
$150-300 usedGTX 1660 Super, RTX 3060
6-12 GB
Up to 7B at Q4, 3B at FP16
Mid-Range
$400-700RTX 4070 Ti Super, RTX 3080
12-16 GB
Up to 13B at Q4, 7B at FP16
High-End
$750-1600RTX 3090, RTX 4090
24 GB
Up to 34B at Q4, 13B at FP16
Professional
$2000-5000RTX A6000, L40S
48 GB
Up to 70B at Q4, 34B at FP16
Enterprise
$10,000-30,000A100 80GB, H100 80GB
80 GB
70B at Q5+, 120B at Q4
Beyond the GPU
System RAM
At minimum, match your VRAM in system RAM. For CPU offloading (running model layers on CPU), you need enough RAM to hold the offloaded layers. 32GB is the practical minimum for 13B+ models; 64GB for 70B.
Storage
Model files range from 3GB (7B Q4) to 400GB+ (405B FP16). Use NVMe SSD for fast model loading. A 1TB NVMe is ideal for keeping multiple models available. Loading from HDD adds 30-60 seconds per model.
CPU
CPU matters less for GPU inference but is critical for CPU-only or hybrid setups. More cores = faster CPU inference. AVX-512 support (Intel 11th gen+, AMD Zen 4+) significantly speeds up CPU inference with llama.cpp.
Power Supply
High-end GPUs draw significant power. RTX 4090 needs 450W alone. For dual GPU setups, plan for 1000W+ PSU. Factor electricity cost into your total cost of ownership vs cloud alternatives.
Multi-GPU Setups
For models that exceed single GPU VRAM, you can split layers across multiple GPUs. This works with llama.cpp, vLLM, and most inference frameworks.
2x RTX 3090 (48GB total)
Best value multi-GPU setup. Runs 70B at Q4 comfortably. Used RTX 3090s cost ~$750 each. Total ~$1,500 for 70B-class capability.
2x RTX 4090 (48GB total)
Faster than 3090s but more expensive. Better for throughput-sensitive workloads. ~$3,200 total.
Multi-GPU requires a motherboard with enough PCIe lanes (x8 minimum per GPU) and adequate spacing between slots for cooling.