How to Size Your First AI Server: A Practical VRAM and RAM Calculator
Photo by Liam Briese on Unsplash
Tutorials

How to Size Your First AI Server: A Practical VRAM and RAM Calculator

January 3, 2026
9 min read
tutorialvramsizingllm-inferencefine-tuninghardware-guideramgpu-memorybuying-guide

TL;DR: For LLM inference, VRAM is almost everything. Divide your model's parameter count by 2 for FP16 memory requirements (7B model = ~14GB VRAM minimum). For training and fine-tuning, you need 4-8x more VRAM than inference. System RAM should be at least 2x your VRAM for comfortable operation.

---

The Question Everyone Gets Wrong

"I need an AI server. What specs do I need?"

This question gets asked constantly, and the answers are usually either too vague ("it depends on your workload") or too specific ("get an H100"). Neither helps someone trying to make their first hardware purchase.

The truth is that AI hardware sizing follows predictable rules once you understand what's actually happening in memory. This guide provides the framework to size your own requirements—plus specific product recommendations from 690 servers and workstations in the AI Hardware Index.

Understanding VRAM: The Limiting Factor

For AI workloads, VRAM (GPU memory) is almost always the bottleneck. Here's why:

  • Model weights must fit in VRAM for efficient inference
  • Activations and gradients for training multiply memory requirements
  • Batch size scales directly with available memory
  • Context length for LLMs consumes additional VRAM

The Basic Formula

For inference (running a trained model): - FP16 (half precision): Parameters × 2 bytes = minimum VRAM - INT8 (8-bit quantization): Parameters × 1 byte = minimum VRAM - INT4 (4-bit quantization): Parameters × 0.5 bytes = minimum VRAM

Model SizeFP16 VRAMINT8 VRAMINT4 VRAM
7B parameters14 GB7 GB3.5 GB
13B parameters26 GB13 GB6.5 GB
30B parameters60 GB30 GB15 GB
70B parameters140 GB70 GB35 GB

Important caveat: These are minimums for the model weights alone. Actual requirements are 20-40% higher due to:

  • KV cache for context (grows with context length)
  • CUDA overhead and memory fragmentation
  • Framework overhead (PyTorch, vLLM, etc.)

Workload Categories and Requirements

Category 1: LLM Inference (Running Models)

Running pre-trained LLMs for chat, completion, or embedding generation.

VRAM Requirements:

Use CaseModel SizeMinimum VRAMRecommended VRAM
Personal assistant7B (Q4)6 GB8-12 GB
Code completion13B-34B (Q4)12 GB16-24 GB
Production API70B (Q4)40 GB48-80 GB
Enterprise RAG70B (FP16)160 GB192+ GB

System RAM: 2x VRAM minimum (for model loading, preprocessing)

CPU: Less critical—4-8 cores sufficient for most inference

Category 2: Fine-Tuning (LoRA/QLoRA)

Adapting pre-trained models to your specific data using parameter-efficient methods.

VRAM Requirements:

Model SizeLoRA (INT8 base)QLoRA (INT4 base)
7B16-24 GB8-16 GB
13B32-40 GB16-24 GB
30B80+ GB40-48 GB
70B160+ GB80+ GB

Fine-tuning requires storing:

  • Base model weights (frozen)
  • LoRA adapter weights (trainable)
  • Optimizer states
  • Gradients for trainable parameters
  • Batch of training examples

System RAM: 4x VRAM recommended (dataset preprocessing, checkpointing)

CPU: 8-16 cores for data loading and preprocessing

Category 3: Full Training (From Scratch)

Training models from random initialization or continued pre-training.

VRAM Requirements:

Model SizeSingle GPUMulti-GPU (Data Parallel)
1B24-48 GB16 GB × 4
7B80+ GB48 GB × 4-8
13B+160+ GB80 GB × 8+

Full training requires 4-8x more memory than inference due to:

  • Optimizer states (Adam: 2x model size)
  • Gradients (same size as model)
  • Activations (scales with batch size and sequence length)

System RAM: 8x VRAM minimum for checkpoint saving

CPU: 16-32 cores for preprocessing pipelines

Category 4: Computer Vision

Image classification, object detection, segmentation, generation.

TaskModelMinimum VRAMRecommended
Classification (ResNet)ResNet-504 GB8 GB
Detection (YOLO)YOLOv8x8 GB12 GB
SegmentationSAM8 GB16 GB
Image GenerationSDXL12 GB16-24 GB
Video GenerationSVD24 GB48 GB

Vision models are generally more VRAM-efficient than LLMs at equivalent capability levels.

Hardware Tiers and Recommendations

Based on the requirements above, here's how to match workloads to hardware:

Tier 1: Entry-Level ($3,000-$8,000)

Best for: Personal inference, small model fine-tuning, vision prototyping

ConfigurationVRAMCapable Of
1× RTX 409024 GB13B inference (Q4), 7B fine-tuning
1× RTX 509032 GB30B inference (Q4), 13B fine-tuning
2× RTX 409048 GB30B inference (FP16), 13B fine-tuning

Recommended products:

Tier 2: Professional ($15,000-$40,000)

Best for: Production inference, 70B fine-tuning, multi-model serving

ConfigurationVRAMCapable Of
2× RTX 509064 GB70B inference (Q4), 30B fine-tuning
4× RTX 409096 GB70B inference (Q8), enterprise RAG
1× A100 80GB80 GB70B inference (Q4), training small models

Recommended products:

Tier 3: Enterprise ($50,000-$200,000)

Best for: 70B training, multi-tenant inference, research

ConfigurationVRAMCapable Of
4× A100 80GB320 GB70B fine-tuning, 7B training
8× A100 80GB640 GB70B training, multi-model clusters
4× H100 80GB320 GBFast 70B fine-tuning, production inference

Recommended products:

Tier 4: Hyperscale ($200,000+)

Best for: Foundation model training, large-scale inference clusters

ConfigurationVRAMCapable Of
8× H100 SXM640 GB70B+ training, production LLM serving
HGX B200 (8 GPU)1.5+ TB400B+ training, frontier models
HGX B300 (8 GPU)2+ TBCutting-edge research

The Sizing Calculator

Use this decision tree to find your starting point:

Step 1: Identify Your Primary Workload

  • Inference only → Go to Step 2A
  • Fine-tuning (LoRA) → Go to Step 2B
  • Full training → Go to Step 2C

Step 2A: Inference Sizing

What's your target model size?

Model SizeWith Quantization (Q4)Without Quantization (FP16)
7B8 GB VRAM16 GB VRAM
13B12 GB VRAM32 GB VRAM
30B24 GB VRAM64 GB VRAM
70B48 GB VRAM160 GB VRAM

Add 20-30% buffer for KV cache and overhead.

Step 2B: Fine-Tuning Sizing

What's your target model size?

Model SizeQLoRA (INT4 base)LoRA (INT8 base)
7B12 GB VRAM24 GB VRAM
13B20 GB VRAM40 GB VRAM
30B48 GB VRAM80 GB VRAM
70B80+ GB VRAM160+ GB VRAM

Step 2C: Training Sizing

Rule of thumb: 4-8× the FP16 model size for optimizer states and gradients.

Model SizeMinimum VRAMRecommended Multi-GPU
1B32 GB24 GB × 2
7B160 GB80 GB × 4
13B+320+ GB80 GB × 8

Step 3: System RAM

Take your total VRAM and multiply:

  • Inference: VRAM × 2 = minimum RAM
  • Fine-tuning: VRAM × 3-4 = minimum RAM
  • Training: VRAM × 4-8 = minimum RAM

Step 4: CPU Cores

  • Inference: 4-8 cores (not the bottleneck)
  • Fine-tuning: 8-16 cores (data loading)
  • Training: 16-32 cores (preprocessing pipelines)

Common Mistakes to Avoid

Mistake 1: Underestimating VRAM for Context

LLM context windows consume VRAM that scales with sequence length:

Context LengthAdditional VRAM (70B model)
4K tokens~2 GB
16K tokens~8 GB
32K tokens~16 GB
128K tokens~64 GB

If you're running long-context applications, add this to your base model requirements.

Mistake 2: Ignoring Multi-GPU Interconnect

For multi-GPU setups, interconnect bandwidth matters:

  • PCIe 4.0 x16: 32 GB/s per GPU (fine for inference)
  • PCIe 5.0 x16: 64 GB/s per GPU (better for training)
  • NVLink: 600-900 GB/s (required for efficient multi-GPU training)

Consumer GPUs (RTX 4090/5090) are PCIe-only. For serious multi-GPU training, look for NVLink-capable systems (A100, H100).

Mistake 3: Treating RAM as an Afterthought

Insufficient system RAM causes:

  • Model loading failures
  • Slow checkpoint saving
  • Dataset preprocessing bottlenecks
  • OOM kills during batch operations

Mistake 4: Overspending on CPU

AI workloads are GPU-bound. A 64-core CPU won't help if you're VRAM-limited. Spend the budget on GPU instead.

Putting It Together: Example Builds

Example 1: Startup MVP ($5,000)

Goal: Run Llama 7B/13B for internal tools

Requirements: 13B at Q4 = 12GB VRAM minimum

Recommendation: Single RTX 4090 workstation (24GB) - Handles up to 30B quantized - Room for SDXL and vision models - Upgrade path to second GPU

Example 2: ML Team Shared Server ($30,000)

Goal: Fine-tune models up to 30B, serve multiple users

Requirements: 30B LoRA = 48GB+ VRAM, multi-user = 2x buffer

Recommendation: 2× A6000 (96GB total) or 4× RTX 4090 (96GB) - Sufficient for parallel fine-tuning jobs - Can serve 70B for inference - Enterprise-grade reliability

Example 3: Production Inference ($80,000)

Goal: Serve 70B model at production scale

Requirements: 70B at Q8 = 80GB VRAM + 50% overhead for serving

Recommendation: 2× H100 80GB or 4× A100 80GB - Low latency inference - Room for multiple model versions - High throughput capacity

Next Steps

  1. Identify your workload using the categories above
  2. Calculate VRAM requirements for your target models
  3. Add appropriate buffers (20-50% depending on workload)
  4. Size RAM and CPU based on multiples
  5. Browse products in your budget tier

---

Browse by Budget:

Browse by GPU:

Sizing guidelines based on published benchmarks and community testing. Actual requirements vary based on framework, optimization level, and specific model architecture.

Share this post