Question 1

What's the best quantization for a given VRAM budget?

Accepted Answer

Q4_K_M is the standard recommendation — best size-to-quality tradeoff. If you have a bit more headroom, Q5_K_M is noticeably sharper. For near-original quality, Q8_0 is effectively lossless. Avoid Q2 and Q3 variants unless severely constrained — perplexity degrades visibly.

Question 2

Why does context length affect VRAM so much?

Accepted Answer

The KV cache grows linearly with context. A 70B model at 4K context uses ~2 GB for KV, but at 128K context that same model needs ~64 GB for the cache alone — before loading weights. This is why long-context workloads require quantized KV cache (--cache-type-k q8_0 in llama.cpp) or CPU offloading.

Question 3

What if the model doesn't fit on my GPU?

Accepted Answer

llama.cpp supports partial GPU offload via -ngl (number of GPU layers). Load as many layers as fit in VRAM, run the rest on CPU RAM. Partial offload is slower but fully functional — many users run 70B models with 40 layers on GPU and the rest on CPU.

Question 4

How is Apple Silicon different for local LLM inference?

Accepted Answer

Apple Silicon uses unified memory — VRAM and RAM are the same physical pool. A Mac with 96 GB can use all 96 GB for model weights plus KV cache, with no discrete VRAM limit. This makes Macs uniquely capable for large models. The bottleneck is memory bandwidth: ~70–90 GB/s on M2 Pro/Max, 200+ GB/s on M3 Ultra.

Question 5

What are KV heads and why do they matter for VRAM calculation?

Accepted Answer

Modern models use Grouped Query Attention (GQA), where many query heads share a single KV head. Llama 3.1 8B has 32 query heads but 8 KV heads — the KV cache is 4x smaller than with classic multi-head attention. This calculator uses the KV head count directly, so estimates are accurate for GQA architectures.

Question 6

Are Mixture-of-Experts (MoE) models accurate in this calculator?

Accepted Answer

Partially. MoE models like DeepSeek V3 load all expert weights into VRAM but only activate a subset per token. This calculator shows the full weight footprint correctly, but actual runtime overhead for MoE routing is not captured. Treat MoE results as a lower bound — add ~10–15% for routing buffers.

GGUF VRAM Calculator

GPU Compatibility

How the math works

Quantization Reference

Frequently Asked Questions

What's the best quantization for a given VRAM budget?

Why does context length affect VRAM so much?

What if the model doesn't fit on my GPU?

How is Apple Silicon different?

What are KV heads and why do they matter?

Are Mixture-of-Experts models accurate here?