AI VRAM Calculator
Calculate exact GPU memory requirements for running, fine-tuning, or training large language models. Pick your model and workload — get a VRAM budget and matching hardware from our catalog of 640 systems.
Configure your workload
Estimate
VRAM breakdown
- Model weights
- 16 GB
- KV cache
- 1 GB
- Activations / buffers
- 1.6 GB
Hardware that can run this
8 systems from our catalog with matching GPUs, sorted by fit and price.

PNY NVIDIA Quadro RTX A5000 24GB GDDR6 Graphics Card (One Pack)

PNY NVIDIA RTX A5000 Professional Graphic Card 24GB GDDR6 PCI Express 4.0 x16, Dual Slot, 3X DisplayPort, 8K Support, Ultra-Quiet Active Fan

Tracer VIII Ultra Gaming I16U 300

VSXR5 540CU
Tracer VII Gaming I16G LC 300
Tracer VII Gaming I16G LC 400

VSXR5 340R7
Super Workstation531AD-I
How this is calculated
Bytes per parameter
- FP16 / BF16: 2 bytes — training + high-quality inference
- INT8: 1 byte — ~2× memory savings, minor quality loss
- INT4: 0.5 bytes — ~4× memory savings, used by GGUF / GPTQ / AWQ
Workload overhead
- Inference: weights + KV cache + ~10% runtime buffers
- LoRA fine-tune: frozen weights + adapter states + activation memory (QLoRA if INT4)
- Full training (Adam): weights + gradients + 2× optimizer states + activations, typically ~16–20 GB per 1B params in FP16
Usage pattern overhead
Longer contexts and more concurrent requests grow the KV cache linearly. Serving 16 production requests at 8k tokens can add tens of GB on top of the model weights. Flash-attention-2 and PagedAttention (vLLM) improve efficiency but don't eliminate the cost.
Caveats
Estimates are approximate. Real requirements depend on model architecture (attention heads, hidden dim, MQA/GQA), kernel choice, and framework overhead. Budget 10–20% headroom beyond the minimum.