How to Size Your First AI Server: A Practical VRAM and RAM Calculator

TL;DR: For LLM inference, VRAM is almost everything. Divide your model's parameter count by 2 for FP16 memory requirements (7B model = ~14GB VRAM minimum). For training and fine-tuning, you need 4-8x more VRAM than inference. System RAM should be at least 2x your VRAM for comfortable operation.

---

The Question Everyone Gets Wrong

"I need an AI server. What specs do I need?"

This question gets asked constantly, and the answers are usually either too vague ("it depends on your workload") or too specific ("get an H100"). Neither helps someone trying to make their first hardware purchase.

The truth is that AI hardware sizing follows predictable rules once you understand what's actually happening in memory. This guide provides the framework to size your own requirements—plus specific product recommendations from 690 servers and workstations in the AI Hardware Index.

Understanding VRAM: The Limiting Factor

For AI workloads, VRAM (GPU memory) is almost always the bottleneck. Here's why:

Model weights must fit in VRAM for efficient inference
Activations and gradients for training multiply memory requirements
Batch size scales directly with available memory
Context length for LLMs consumes additional VRAM

The Basic Formula

For inference (running a trained model): - FP16 (half precision): Parameters × 2 bytes = minimum VRAM - INT8 (8-bit quantization): Parameters × 1 byte = minimum VRAM - INT4 (4-bit quantization): Parameters × 0.5 bytes = minimum VRAM

Model Size	FP16 VRAM	INT8 VRAM	INT4 VRAM
7B parameters	14 GB	7 GB	3.5 GB
13B parameters	26 GB	13 GB	6.5 GB
30B parameters	60 GB	30 GB	15 GB
70B parameters	140 GB	70 GB	35 GB

Important caveat: These are minimums for the model weights alone. Actual requirements are 20-40% higher due to:

KV cache for context (grows with context length)
CUDA overhead and memory fragmentation
Framework overhead (PyTorch, vLLM, etc.)

Workload Categories and Requirements

Category 1: LLM Inference (Running Models)

Running pre-trained LLMs for chat, completion, or embedding generation.

VRAM Requirements:

Use Case	Model Size	Minimum VRAM	Recommended VRAM
Personal assistant	7B (Q4)	6 GB	8-12 GB
Code completion	13B-34B (Q4)	12 GB	16-24 GB
Production API	70B (Q4)	40 GB	48-80 GB
Enterprise RAG	70B (FP16)	160 GB	192+ GB

System RAM: 2x VRAM minimum (for model loading, preprocessing)

CPU: Less critical—4-8 cores sufficient for most inference

Category 2: Fine-Tuning (LoRA/QLoRA)

Adapting pre-trained models to your specific data using parameter-efficient methods.

VRAM Requirements:

Model Size	LoRA (INT8 base)	QLoRA (INT4 base)
7B	16-24 GB	8-16 GB
13B	32-40 GB	16-24 GB
30B	80+ GB	40-48 GB
70B	160+ GB	80+ GB

Fine-tuning requires storing:

Base model weights (frozen)
LoRA adapter weights (trainable)
Optimizer states
Gradients for trainable parameters
Batch of training examples

System RAM: 4x VRAM recommended (dataset preprocessing, checkpointing)

CPU: 8-16 cores for data loading and preprocessing

Category 3: Full Training (From Scratch)

Training models from random initialization or continued pre-training.

VRAM Requirements:

Model Size	Single GPU	Multi-GPU (Data Parallel)
1B	24-48 GB	16 GB × 4
7B	80+ GB	48 GB × 4-8
13B+	160+ GB	80 GB × 8+

Full training requires 4-8x more memory than inference due to:

Optimizer states (Adam: 2x model size)
Gradients (same size as model)
Activations (scales with batch size and sequence length)

System RAM: 8x VRAM minimum for checkpoint saving

CPU: 16-32 cores for preprocessing pipelines

Category 4: Computer Vision

Image classification, object detection, segmentation, generation.

Task	Model	Minimum VRAM	Recommended
Classification (ResNet)	ResNet-50	4 GB	8 GB
Detection (YOLO)	YOLOv8x	8 GB	12 GB
Segmentation	SAM	8 GB	16 GB
Image Generation	SDXL	12 GB	16-24 GB
Video Generation	SVD	24 GB	48 GB

Vision models are generally more VRAM-efficient than LLMs at equivalent capability levels.

Hardware Tiers and Recommendations

Based on the requirements above, here's how to match workloads to hardware:

Tier 1: Entry-Level ($3,000-$8,000)

Best for: Personal inference, small model fine-tuning, vision prototyping

Configuration	VRAM	Capable Of
1× RTX 4090	24 GB	13B inference (Q4), 7B fine-tuning
1× RTX 5090	32 GB	30B inference (Q4), 13B fine-tuning
2× RTX 4090	48 GB	30B inference (FP16), 13B fine-tuning

Recommended products:

Tier 2: Professional ($15,000-$40,000)

Best for: Production inference, 70B fine-tuning, multi-model serving

Configuration	VRAM	Capable Of
2× RTX 5090	64 GB	70B inference (Q4), 30B fine-tuning
4× RTX 4090	96 GB	70B inference (Q8), enterprise RAG
1× A100 80GB	80 GB	70B inference (Q4), training small models

Recommended products:

Tier 3: Enterprise ($50,000-$200,000)

Best for: 70B training, multi-tenant inference, research

Configuration	VRAM	Capable Of
4× A100 80GB	320 GB	70B fine-tuning, 7B training
8× A100 80GB	640 GB	70B training, multi-model clusters
4× H100 80GB	320 GB	Fast 70B fine-tuning, production inference

Recommended products:

Tier 4: Hyperscale ($200,000+)

Best for: Foundation model training, large-scale inference clusters

Configuration	VRAM	Capable Of
8× H100 SXM	640 GB	70B+ training, production LLM serving
HGX B200 (8 GPU)	1.5+ TB	400B+ training, frontier models
HGX B300 (8 GPU)	2+ TB	Cutting-edge research

The Sizing Calculator

Use this decision tree to find your starting point:

Step 1: Identify Your Primary Workload

Inference only → Go to Step 2A
Fine-tuning (LoRA) → Go to Step 2B
Full training → Go to Step 2C

Step 2A: Inference Sizing

What's your target model size?

Model Size	With Quantization (Q4)	Without Quantization (FP16)
7B	8 GB VRAM	16 GB VRAM
13B	12 GB VRAM	32 GB VRAM
30B	24 GB VRAM	64 GB VRAM
70B	48 GB VRAM	160 GB VRAM

Add 20-30% buffer for KV cache and overhead.

Step 2B: Fine-Tuning Sizing

What's your target model size?

Model Size	QLoRA (INT4 base)	LoRA (INT8 base)
7B	12 GB VRAM	24 GB VRAM
13B	20 GB VRAM	40 GB VRAM
30B	48 GB VRAM	80 GB VRAM
70B	80+ GB VRAM	160+ GB VRAM

Step 2C: Training Sizing

Rule of thumb: 4-8× the FP16 model size for optimizer states and gradients.

Model Size	Minimum VRAM	Recommended Multi-GPU
1B	32 GB	24 GB × 2
7B	160 GB	80 GB × 4
13B+	320+ GB	80 GB × 8

Step 3: System RAM

Take your total VRAM and multiply:

Inference: VRAM × 2 = minimum RAM
Fine-tuning: VRAM × 3-4 = minimum RAM
Training: VRAM × 4-8 = minimum RAM

Step 4: CPU Cores

Inference: 4-8 cores (not the bottleneck)
Fine-tuning: 8-16 cores (data loading)
Training: 16-32 cores (preprocessing pipelines)

Common Mistakes to Avoid

Mistake 1: Underestimating VRAM for Context

LLM context windows consume VRAM that scales with sequence length:

Context Length	Additional VRAM (70B model)
4K tokens	~2 GB
16K tokens	~8 GB
32K tokens	~16 GB
128K tokens	~64 GB

If you're running long-context applications, add this to your base model requirements.

Mistake 2: Ignoring Multi-GPU Interconnect

For multi-GPU setups, interconnect bandwidth matters:

PCIe 4.0 x16: 32 GB/s per GPU (fine for inference)
PCIe 5.0 x16: 64 GB/s per GPU (better for training)
NVLink: 600-900 GB/s (required for efficient multi-GPU training)

Consumer GPUs (RTX 4090/5090) are PCIe-only. For serious multi-GPU training, look for NVLink-capable systems (A100, H100).

Mistake 3: Treating RAM as an Afterthought

Insufficient system RAM causes:

Model loading failures
Slow checkpoint saving
Dataset preprocessing bottlenecks
OOM kills during batch operations

Mistake 4: Overspending on CPU

AI workloads are GPU-bound. A 64-core CPU won't help if you're VRAM-limited. Spend the budget on GPU instead.

Putting It Together: Example Builds

Example 1: Startup MVP ($5,000)

Goal: Run Llama 7B/13B for internal tools

Requirements: 13B at Q4 = 12GB VRAM minimum

Recommendation: Single RTX 4090 workstation (24GB) - Handles up to 30B quantized - Room for SDXL and vision models - Upgrade path to second GPU

Example 2: ML Team Shared Server ($30,000)

Goal: Fine-tune models up to 30B, serve multiple users

Requirements: 30B LoRA = 48GB+ VRAM, multi-user = 2x buffer

Recommendation: 2× A6000 (96GB total) or 4× RTX 4090 (96GB) - Sufficient for parallel fine-tuning jobs - Can serve 70B for inference - Enterprise-grade reliability

Example 3: Production Inference ($80,000)

Goal: Serve 70B model at production scale

Requirements: 70B at Q8 = 80GB VRAM + 50% overhead for serving

Recommendation: 2× H100 80GB or 4× A100 80GB - Low latency inference - Room for multiple model versions - High throughput capacity

Next Steps

Identify your workload using the categories above
Calculate VRAM requirements for your target models
Add appropriate buffers (20-50% depending on workload)
Size RAM and CPU based on multiples
Browse products in your budget tier

---

Browse by Budget:

Browse by GPU:

Sizing guidelines based on published benchmarks and community testing. Actual requirements vary based on framework, optimization level, and specific model architecture.

How to Size Your First AI Server: A Practical VRAM and RAM Calculator

The Question Everyone Gets Wrong

Understanding VRAM: The Limiting Factor

The Basic Formula

Workload Categories and Requirements

Category 1: LLM Inference (Running Models)

Category 2: Fine-Tuning (LoRA/QLoRA)

Category 3: Full Training (From Scratch)

Category 4: Computer Vision

Hardware Tiers and Recommendations

Tier 1: Entry-Level ($3,000-$8,000)

Tier 2: Professional ($15,000-$40,000)

Tier 3: Enterprise ($50,000-$200,000)

Tier 4: Hyperscale ($200,000+)

The Sizing Calculator

Step 1: Identify Your Primary Workload

Step 2A: Inference Sizing

Step 2B: Fine-Tuning Sizing

Step 2C: Training Sizing

Step 3: System RAM

Step 4: CPU Cores

Common Mistakes to Avoid

Mistake 1: Underestimating VRAM for Context

Mistake 2: Ignoring Multi-GPU Interconnect

Mistake 3: Treating RAM as an Afterthought

Mistake 4: Overspending on CPU

Putting It Together: Example Builds

Example 1: Startup MVP ($5,000)

Example 2: ML Team Shared Server ($30,000)

Example 3: Production Inference ($80,000)

Next Steps

Share this post