GGUF VRAM Calculator

Know exactly how much VRAM your model needs before you download. Pick a preset or enter your model's architecture parameters.

Model & Quantization
KV cache scales linearly. Llama 3.1 max is 131,072 tokens.
Pass --cache-type-k q8_0 to llama.cpp to reduce KV memory.
Memory Requirements
Total VRAM Required
GB
Model weights
KV cache
Runtime overhead ~0.50 GB
Total
Select a model to see memory requirements.

GPU Compatibility

How the math works

VRAM has three components: quantized model weights, the KV attention cache, and a fixed runtime overhead of ~500 MB for CUDA context, activations, and scratch buffers.

model_gb = params_B × 10&sup9; × bits_per_weight ÷ 8 ÷ 1024³ kv_gb = 2 × n_layers × n_kv_heads × head_dim × ctx_len × bytes_per_kv ÷ 1024³ total_gb = model_gb + kv_gb + 0.5

The factor accounts for both K and V tensors. n_kv_heads uses the actual GQA count, not query heads — Llama 3.1 8B has 32 query heads but only 8 KV heads, making the KV cache 4× smaller. The 0.5 GB overhead is a conservative median; real usage may be 200–800 MB depending on batch size and backend.

Quantization Reference

All GGUF quantization types with effective bits per weight. "K" variants use k-quants (block-wise mixed precision) for better quality at the same file size.

Type Bits/weight Quality Notes

Frequently Asked Questions

What's the best quantization for a given VRAM budget?

Q4_K_M is the standard recommendation — best size-to-quality tradeoff. If you have a bit more headroom, Q5_K_M is noticeably sharper. For near-original quality, Q8_0 is effectively lossless. Avoid Q2 and Q3 variants unless severely constrained — perplexity degrades visibly.

Why does context length affect VRAM so much?

The KV cache grows linearly with context. A 70B model at 4K context uses ~2 GB for KV, but at 128K context that same model needs ~64 GB for the cache alone — before loading weights. This is why long-context workloads require quantized KV cache (--cache-type-k q8_0 in llama.cpp) or CPU offloading.

What if the model doesn't fit on my GPU?

llama.cpp supports partial GPU offload via -ngl (number of GPU layers). Load as many layers as fit in VRAM, run the rest on CPU RAM. The calculator shows a suggested layer count based on a 24 GB GPU when the model exceeds that threshold. Partial offload is slower but fully functional — many users run 70B models with 40 layers on GPU and the rest on CPU.

How is Apple Silicon different?

Apple Silicon uses unified memory — VRAM and RAM are the same physical pool. A Mac with 96 GB can use all 96 GB for model weights plus KV cache, with no discrete VRAM limit. This makes Macs uniquely capable for large models. The bottleneck is memory bandwidth: ~70–90 GB/s on M2 Pro/Max, 200+ GB/s on M3 Ultra.

What are KV heads and why do they matter?

Modern models use Grouped Query Attention (GQA), where many query heads share a single KV head. Llama 3.1 8B has 32 query heads but 8 KV heads — the KV cache is 4× smaller than with classic multi-head attention. This calculator uses the KV head count directly, so estimates are accurate for GQA architectures. To check a model's parameters: llama-gguf-split --info model.gguf.

Are Mixture-of-Experts models accurate here?

Partially. MoE models like DeepSeek V3 load all expert weights into VRAM but only activate a subset per token, so throughput memory patterns differ. This calculator shows the full weight footprint (correct), but actual runtime overhead for MoE routing is not captured. Treat MoE results as a lower bound — add ~10–15% for routing buffers.

This calculator runs entirely in your browser — no data is sent anywhere. Architecture parameters for preset models are sourced from official model cards and cross-checked against GGUF file sizes on Hugging Face. For exact byte-level analysis of a specific file, use llama-gguf-split --info model.gguf from the llama.cpp toolkit.