The Key Bottleneck: VRAM

GPU VRAM is the primary constraint for running LLMs locally. The model weights, KV cache, and activation memory must all fit in VRAM for GPU inference. If they don't fit, you can offload layers to system RAM (much slower) or use quantization to shrink the model.

Quick rule of thumb for VRAM requirements:

VRAM ≈ Parameters (B) × 2 GB (FP16) | Parameters (B) × 0.6 GB (Q4)

Add 1-4 GB for KV cache overhead depending on context length (2K-128K tokens)

Requirements by Model Size

Size	VRAM (FP16)	VRAM (Q4)	System RAM	Min GPU	Use Case
1-3B	2-6 GB	1-2 GB	8 GB	GTX 1060 6GB+	Mobile, edge, embedded, fast chat
7-8B	14-16 GB	4-5 GB	16 GB	RTX 3060 12GB	General chat, coding, summarization
13-14B	26-28 GB	8-9 GB	32 GB	RTX 4070 Ti 16GB	Better reasoning, complex tasks
32-34B	64-68 GB	18-20 GB	32 GB	RTX 3090 24GB	Near-GPT-4 quality, research
70B	140 GB	35-40 GB	64 GB	2x RTX 3090 or A100 80GB	High-quality reasoning, enterprise
120-180B	240-360 GB	65-100 GB	128 GB	4x A100 80GB or 8x RTX 3090	Research, frontier capability
405B	810 GB	~220 GB	256 GB	8x A100 80GB+	Maximum capability, enterprise

VRAM figures include ~10% overhead for KV cache at 2K context. Longer contexts require more VRAM.

GPU Tiers for Local AI

Entry

$150-300 used

GPUs

GTX 1660 Super, RTX 3060

VRAM

6-12 GB

Can Run

Up to 7B at Q4, 3B at FP16

Mid-Range

$400-700

GPUs

RTX 4070 Ti Super, RTX 3080

VRAM

12-16 GB

Can Run

Up to 13B at Q4, 7B at FP16

High-End

$750-1600

GPUs

RTX 3090, RTX 4090

VRAM

24 GB

Can Run

Up to 34B at Q4, 13B at FP16

Professional

$2000-5000

GPUs

RTX A6000, L40S

VRAM

48 GB

Can Run

Up to 70B at Q4, 34B at FP16

Enterprise

$10,000-30,000

GPUs

A100 80GB, H100 80GB

VRAM

80 GB

Can Run

70B at Q5+, 120B at Q4

Beyond the GPU

System RAM

At minimum, match your VRAM in system RAM. For CPU offloading (running model layers on CPU), you need enough RAM to hold the offloaded layers. 32GB is the practical minimum for 13B+ models; 64GB for 70B.

Storage

Model files range from 3GB (7B Q4) to 400GB+ (405B FP16). Use NVMe SSD for fast model loading. A 1TB NVMe is ideal for keeping multiple models available. Loading from HDD adds 30-60 seconds per model.

CPU

CPU matters less for GPU inference but is critical for CPU-only or hybrid setups. More cores = faster CPU inference. AVX-512 support (Intel 11th gen+, AMD Zen 4+) significantly speeds up CPU inference with llama.cpp.

Power Supply

High-end GPUs draw significant power. RTX 4090 needs 450W alone. For dual GPU setups, plan for 1000W+ PSU. Factor electricity cost into your total cost of ownership vs cloud alternatives.

Multi-GPU Setups

For models that exceed single GPU VRAM, you can split layers across multiple GPUs. This works with llama.cpp, vLLM, and most inference frameworks.

2x RTX 3090 (48GB total)

Best value multi-GPU setup. Runs 70B at Q4 comfortably. Used RTX 3090s cost ~$750 each. Total ~$1,500 for 70B-class capability.

2x RTX 4090 (48GB total)

Faster than 3090s but more expensive. Better for throughput-sensitive workloads. ~$3,200 total.

Multi-GPU requires a motherboard with enough PCIe lanes (x8 minimum per GPU) and adequate spacing between slots for cooling.

Best Value Recommendations

7B models:Used RTX 3060 12GB (~$180). Runs all 7B models at Q4-Q8 with room to spare.

13B models:RTX 4070 Ti Super 16GB (~$700) or used RTX 3090 24GB (~$750) for extra headroom.

70B models:2x used RTX 3090 (~$1,500 total). Best price-to-VRAM ratio for large models.

Enterprise:Consider cloud GPUs for occasional use, or H100 clusters for sustained workloads.

The Key Bottleneck: VRAM

Quick rule of thumb for VRAM requirements:

VRAM ≈ Parameters (B) × 2 GB (FP16) | Parameters (B) × 0.6 GB (Q4)

Add 1-4 GB for KV cache overhead depending on context length (2K-128K tokens)

Requirements by Model Size

Size	VRAM (FP16)	VRAM (Q4)	System RAM	Min GPU	Use Case
1-3B	2-6 GB	1-2 GB	8 GB	GTX 1060 6GB+	Mobile, edge, embedded, fast chat
7-8B	14-16 GB	4-5 GB	16 GB	RTX 3060 12GB	General chat, coding, summarization
13-14B	26-28 GB	8-9 GB	32 GB	RTX 4070 Ti 16GB	Better reasoning, complex tasks
32-34B	64-68 GB	18-20 GB	32 GB	RTX 3090 24GB	Near-GPT-4 quality, research
70B	140 GB	35-40 GB	64 GB	2x RTX 3090 or A100 80GB	High-quality reasoning, enterprise
120-180B	240-360 GB	65-100 GB	128 GB	4x A100 80GB or 8x RTX 3090	Research, frontier capability
405B	810 GB	~220 GB	256 GB	8x A100 80GB+	Maximum capability, enterprise

VRAM figures include ~10% overhead for KV cache at 2K context. Longer contexts require more VRAM.

GPU Tiers for Local AI

Entry

$150-300 used

GPUs

GTX 1660 Super, RTX 3060

VRAM

6-12 GB

Can Run

Up to 7B at Q4, 3B at FP16

Mid-Range

$400-700

GPUs

RTX 4070 Ti Super, RTX 3080

VRAM

12-16 GB

Can Run

Up to 13B at Q4, 7B at FP16

High-End

$750-1600

GPUs

RTX 3090, RTX 4090

VRAM

24 GB

Can Run

Up to 34B at Q4, 13B at FP16

Professional

$2000-5000

GPUs

RTX A6000, L40S

VRAM

48 GB

Can Run

Up to 70B at Q4, 34B at FP16

Enterprise

$10,000-30,000

GPUs

A100 80GB, H100 80GB

VRAM

80 GB

Can Run

70B at Q5+, 120B at Q4

Beyond the GPU

System RAM

Storage

CPU

Power Supply

High-end GPUs draw significant power. RTX 4090 needs 450W alone. For dual GPU setups, plan for 1000W+ PSU. Factor electricity cost into your total cost of ownership vs cloud alternatives.

Multi-GPU Setups

For models that exceed single GPU VRAM, you can split layers across multiple GPUs. This works with llama.cpp, vLLM, and most inference frameworks.

2x RTX 3090 (48GB total)

Best value multi-GPU setup. Runs 70B at Q4 comfortably. Used RTX 3090s cost ~$750 each. Total ~$1,500 for 70B-class capability.

2x RTX 4090 (48GB total)

Faster than 3090s but more expensive. Better for throughput-sensitive workloads. ~$3,200 total.

Multi-GPU requires a motherboard with enough PCIe lanes (x8 minimum per GPU) and adequate spacing between slots for cooling.

Best Value Recommendations

7B models:Used RTX 3060 12GB (~$180). Runs all 7B models at Q4-Q8 with room to spare.

13B models:RTX 4070 Ti Super 16GB (~$700) or used RTX 3090 24GB (~$750) for extra headroom.

70B models:2x used RTX 3090 (~$1,500 total). Best price-to-VRAM ratio for large models.

Enterprise:Consider cloud GPUs for occasional use, or H100 clusters for sustained workloads.

LLM Hardware Requirements

The Key Bottleneck: VRAM

Requirements by Model Size

GPU Tiers for Local AI

Entry

Mid-Range

High-End

Professional

Enterprise

Beyond the GPU

System RAM

Storage

CPU

Power Supply

Multi-GPU Setups

2x RTX 3090 (48GB total)

2x RTX 4090 (48GB total)

Best Value Recommendations

LLM Hardware Requirements

The Key Bottleneck: VRAM

Requirements by Model Size

GPU Tiers for Local AI

Entry

Mid-Range

High-End

Professional

Enterprise

Beyond the GPU

System RAM

Storage

CPU

Power Supply

Multi-GPU Setups

2x RTX 3090 (48GB total)

2x RTX 4090 (48GB total)

Best Value Recommendations