GGUF VRAM Calculator
Know exactly how much VRAM your model needs before you download. Pick a preset or enter your model's architecture parameters.
--cache-type-k q8_0 to llama.cpp to reduce KV memory.GPU Compatibility
How the math works
VRAM has three components: quantized model weights, the KV attention cache, and a fixed runtime overhead of ~500 MB for CUDA context, activations, and scratch buffers.
The 2× factor accounts for both K and V tensors. n_kv_heads uses the actual GQA count, not query heads — Llama 3.1 8B has 32 query heads but only 8 KV heads, making the KV cache 4× smaller. The 0.5 GB overhead is a conservative median; real usage may be 200–800 MB depending on batch size and backend.
Quantization Reference
All GGUF quantization types with effective bits per weight. "K" variants use k-quants (block-wise mixed precision) for better quality at the same file size.
| Type | Bits/weight | Quality | Notes |
|---|
Frequently Asked Questions
What's the best quantization for a given VRAM budget?
Why does context length affect VRAM so much?
--cache-type-k q8_0 in llama.cpp) or CPU offloading.
What if the model doesn't fit on my GPU?
-ngl (number of GPU layers). Load as many layers as fit in VRAM, run the rest on CPU RAM. The calculator shows a suggested layer count based on a 24 GB GPU when the model exceeds that threshold. Partial offload is slower but fully functional — many users run 70B models with 40 layers on GPU and the rest on CPU.
How is Apple Silicon different?
What are KV heads and why do they matter?
llama-gguf-split --info model.gguf.
Are Mixture-of-Experts models accurate here?
This calculator runs entirely in your browser — no data is sent anywhere. Architecture parameters for preset models are sourced from official model cards and cross-checked against GGUF file sizes on Hugging Face. For exact byte-level analysis of a specific file, use llama-gguf-split --info model.gguf from the llama.cpp toolkit.