TL;DR: For LLM inference, VRAM is almost everything. Divide your model's parameter count by 2 for FP16 memory requirements (7B model = ~14GB VRAM minimum). For training and fine-tuning, you need 4-8x more VRAM than inference. System RAM should be at least 2x your VRAM for comfortable operation.
---
The Question Everyone Gets Wrong
"I need an AI server. What specs do I need?"
This question gets asked constantly, and the answers are usually either too vague ("it depends on your workload") or too specific ("get an H100"). Neither helps someone trying to make their first hardware purchase.
The truth is that AI hardware sizing follows predictable rules once you understand what's actually happening in memory. This guide provides the framework to size your own requirements—plus specific product recommendations from 690 servers and workstations in the AI Hardware Index.
Understanding VRAM: The Limiting Factor
For AI workloads, VRAM (GPU memory) is almost always the bottleneck. Here's why:
- Model weights must fit in VRAM for efficient inference
- Activations and gradients for training multiply memory requirements
- Batch size scales directly with available memory
- Context length for LLMs consumes additional VRAM
The Basic Formula
For inference (running a trained model): - FP16 (half precision): Parameters × 2 bytes = minimum VRAM - INT8 (8-bit quantization): Parameters × 1 byte = minimum VRAM - INT4 (4-bit quantization): Parameters × 0.5 bytes = minimum VRAM
| Model Size | FP16 VRAM | INT8 VRAM | INT4 VRAM |
|---|---|---|---|
| 7B parameters | 14 GB | 7 GB | 3.5 GB |
| 13B parameters | 26 GB | 13 GB | 6.5 GB |
| 30B parameters | 60 GB | 30 GB | 15 GB |
| 70B parameters | 140 GB | 70 GB | 35 GB |
Important caveat: These are minimums for the model weights alone. Actual requirements are 20-40% higher due to:
- KV cache for context (grows with context length)
- CUDA overhead and memory fragmentation
- Framework overhead (PyTorch, vLLM, etc.)
Workload Categories and Requirements
Category 1: LLM Inference (Running Models)
Running pre-trained LLMs for chat, completion, or embedding generation.
VRAM Requirements:
| Use Case | Model Size | Minimum VRAM | Recommended VRAM |
|---|---|---|---|
| Personal assistant | 7B (Q4) | 6 GB | 8-12 GB |
| Code completion | 13B-34B (Q4) | 12 GB | 16-24 GB |
| Production API | 70B (Q4) | 40 GB | 48-80 GB |
| Enterprise RAG | 70B (FP16) | 160 GB | 192+ GB |
System RAM: 2x VRAM minimum (for model loading, preprocessing)
CPU: Less critical—4-8 cores sufficient for most inference
Category 2: Fine-Tuning (LoRA/QLoRA)
Adapting pre-trained models to your specific data using parameter-efficient methods.
VRAM Requirements:
| Model Size | LoRA (INT8 base) | QLoRA (INT4 base) |
|---|---|---|
| 7B | 16-24 GB | 8-16 GB |
| 13B | 32-40 GB | 16-24 GB |
| 30B | 80+ GB | 40-48 GB |
| 70B | 160+ GB | 80+ GB |
Fine-tuning requires storing:
- Base model weights (frozen)
- LoRA adapter weights (trainable)
- Optimizer states
- Gradients for trainable parameters
- Batch of training examples
System RAM: 4x VRAM recommended (dataset preprocessing, checkpointing)
CPU: 8-16 cores for data loading and preprocessing
Category 3: Full Training (From Scratch)
Training models from random initialization or continued pre-training.
VRAM Requirements:
| Model Size | Single GPU | Multi-GPU (Data Parallel) |
|---|---|---|
| 1B | 24-48 GB | 16 GB × 4 |
| 7B | 80+ GB | 48 GB × 4-8 |
| 13B+ | 160+ GB | 80 GB × 8+ |
Full training requires 4-8x more memory than inference due to:
- Optimizer states (Adam: 2x model size)
- Gradients (same size as model)
- Activations (scales with batch size and sequence length)
System RAM: 8x VRAM minimum for checkpoint saving
CPU: 16-32 cores for preprocessing pipelines
Category 4: Computer Vision
Image classification, object detection, segmentation, generation.
| Task | Model | Minimum VRAM | Recommended |
|---|---|---|---|
| Classification (ResNet) | ResNet-50 | 4 GB | 8 GB |
| Detection (YOLO) | YOLOv8x | 8 GB | 12 GB |
| Segmentation | SAM | 8 GB | 16 GB |
| Image Generation | SDXL | 12 GB | 16-24 GB |
| Video Generation | SVD | 24 GB | 48 GB |
Vision models are generally more VRAM-efficient than LLMs at equivalent capability levels.
Hardware Tiers and Recommendations
Based on the requirements above, here's how to match workloads to hardware:
Tier 1: Entry-Level ($3,000-$8,000)
Best for: Personal inference, small model fine-tuning, vision prototyping
| Configuration | VRAM | Capable Of |
|---|---|---|
| 1× RTX 4090 | 24 GB | 13B inference (Q4), 7B fine-tuning |
| 1× RTX 5090 | 32 GB | 30B inference (Q4), 13B fine-tuning |
| 2× RTX 4090 | 48 GB | 30B inference (FP16), 13B fine-tuning |
Recommended products:
Tier 2: Professional ($15,000-$40,000)
Best for: Production inference, 70B fine-tuning, multi-model serving
| Configuration | VRAM | Capable Of |
|---|---|---|
| 2× RTX 5090 | 64 GB | 70B inference (Q4), 30B fine-tuning |
| 4× RTX 4090 | 96 GB | 70B inference (Q8), enterprise RAG |
| 1× A100 80GB | 80 GB | 70B inference (Q4), training small models |
Recommended products:
Tier 3: Enterprise ($50,000-$200,000)
Best for: 70B training, multi-tenant inference, research
| Configuration | VRAM | Capable Of |
|---|---|---|
| 4× A100 80GB | 320 GB | 70B fine-tuning, 7B training |
| 8× A100 80GB | 640 GB | 70B training, multi-model clusters |
| 4× H100 80GB | 320 GB | Fast 70B fine-tuning, production inference |
Recommended products:
Tier 4: Hyperscale ($200,000+)
Best for: Foundation model training, large-scale inference clusters
| Configuration | VRAM | Capable Of |
|---|---|---|
| 8× H100 SXM | 640 GB | 70B+ training, production LLM serving |
| HGX B200 (8 GPU) | 1.5+ TB | 400B+ training, frontier models |
| HGX B300 (8 GPU) | 2+ TB | Cutting-edge research |
The Sizing Calculator
Use this decision tree to find your starting point:
Step 1: Identify Your Primary Workload
- Inference only → Go to Step 2A
- Fine-tuning (LoRA) → Go to Step 2B
- Full training → Go to Step 2C
Step 2A: Inference Sizing
What's your target model size?
| Model Size | With Quantization (Q4) | Without Quantization (FP16) |
|---|---|---|
| 7B | 8 GB VRAM | 16 GB VRAM |
| 13B | 12 GB VRAM | 32 GB VRAM |
| 30B | 24 GB VRAM | 64 GB VRAM |
| 70B | 48 GB VRAM | 160 GB VRAM |
Add 20-30% buffer for KV cache and overhead.
Step 2B: Fine-Tuning Sizing
What's your target model size?
| Model Size | QLoRA (INT4 base) | LoRA (INT8 base) |
|---|---|---|
| 7B | 12 GB VRAM | 24 GB VRAM |
| 13B | 20 GB VRAM | 40 GB VRAM |
| 30B | 48 GB VRAM | 80 GB VRAM |
| 70B | 80+ GB VRAM | 160+ GB VRAM |
Step 2C: Training Sizing
Rule of thumb: 4-8× the FP16 model size for optimizer states and gradients.
| Model Size | Minimum VRAM | Recommended Multi-GPU |
|---|---|---|
| 1B | 32 GB | 24 GB × 2 |
| 7B | 160 GB | 80 GB × 4 |
| 13B+ | 320+ GB | 80 GB × 8 |
Step 3: System RAM
Take your total VRAM and multiply:
- Inference: VRAM × 2 = minimum RAM
- Fine-tuning: VRAM × 3-4 = minimum RAM
- Training: VRAM × 4-8 = minimum RAM
Step 4: CPU Cores
- Inference: 4-8 cores (not the bottleneck)
- Fine-tuning: 8-16 cores (data loading)
- Training: 16-32 cores (preprocessing pipelines)
Common Mistakes to Avoid
Mistake 1: Underestimating VRAM for Context
LLM context windows consume VRAM that scales with sequence length:
| Context Length | Additional VRAM (70B model) |
|---|---|
| 4K tokens | ~2 GB |
| 16K tokens | ~8 GB |
| 32K tokens | ~16 GB |
| 128K tokens | ~64 GB |
If you're running long-context applications, add this to your base model requirements.
Mistake 2: Ignoring Multi-GPU Interconnect
For multi-GPU setups, interconnect bandwidth matters:
- PCIe 4.0 x16: 32 GB/s per GPU (fine for inference)
- PCIe 5.0 x16: 64 GB/s per GPU (better for training)
- NVLink: 600-900 GB/s (required for efficient multi-GPU training)
Consumer GPUs (RTX 4090/5090) are PCIe-only. For serious multi-GPU training, look for NVLink-capable systems (A100, H100).
Mistake 3: Treating RAM as an Afterthought
Insufficient system RAM causes:
- Model loading failures
- Slow checkpoint saving
- Dataset preprocessing bottlenecks
- OOM kills during batch operations
Mistake 4: Overspending on CPU
AI workloads are GPU-bound. A 64-core CPU won't help if you're VRAM-limited. Spend the budget on GPU instead.
Putting It Together: Example Builds
Example 1: Startup MVP ($5,000)
Goal: Run Llama 7B/13B for internal tools
Requirements: 13B at Q4 = 12GB VRAM minimum
Recommendation: Single RTX 4090 workstation (24GB) - Handles up to 30B quantized - Room for SDXL and vision models - Upgrade path to second GPU
Example 2: ML Team Shared Server ($30,000)
Goal: Fine-tune models up to 30B, serve multiple users
Requirements: 30B LoRA = 48GB+ VRAM, multi-user = 2x buffer
Recommendation: 2× A6000 (96GB total) or 4× RTX 4090 (96GB) - Sufficient for parallel fine-tuning jobs - Can serve 70B for inference - Enterprise-grade reliability
Example 3: Production Inference ($80,000)
Goal: Serve 70B model at production scale
Requirements: 70B at Q8 = 80GB VRAM + 50% overhead for serving
Recommendation: 2× H100 80GB or 4× A100 80GB - Low latency inference - Room for multiple model versions - High throughput capacity
Next Steps
- Identify your workload using the categories above
- Calculate VRAM requirements for your target models
- Add appropriate buffers (20-50% depending on workload)
- Size RAM and CPU based on multiples
- Browse products in your budget tier
---
Browse by Budget:
- AI Servers Under $10,000
- AI Servers Under $25,000
- AI Servers Under $50,000
- AI Servers Under $100,000
Browse by GPU:
Sizing guidelines based on published benchmarks and community testing. Actual requirements vary based on framework, optimization level, and specific model architecture.