LLM VRAM Requirements

How It's Calculated

VRAM Estimation Formula

VRAM Calculator

Enter a parameter count to estimate VRAM across all quant levels

Parameters

Overhead %

Enter a parameter count above to calculate VRAM requirements

The Formula

VRAM = params × bits_per_weight ÷ 8 × 1.18

Where:
params = total parameter count
bits_per_weight = bits per weight (see table below)
1.18 = +18% overhead (KV cache + activations)

Each weight is stored as a fixed number of bits depending on quantisation level. Dividing by 8 converts bits → bytes, then dividing by 1.073×10⁹ converts to GB. A flat 18% overhead is added to account for the KV cache, activation buffers, and CUDA runtime costs at a typical inference context length.

Worked Example — Llama-3.1-8B at Q4

params = 8,000,000,000
bits_per_weight = 4.5 (Q4)

raw bytes = 8.0×10⁹ × 4.5 ÷ 8
= 4,500,000,000 bytes

+ 18% overhead × 1.18
= 5,310,000,000 bytes

÷ 1,073,741,824 (bytes per GB)
≈ 4.95 GB → table shows 4.2 GB (Q4)

⚡ MoE models (Mixtral, DeepSeek-V3, Llama 4): all expert weights must be loaded into VRAM even though only a subset activates per token. VRAM is estimated using the total parameter count, not the active count.

Bits Per Weight by Quant Level

Level	Bits / Weight	Bytes / Weight	Relative Size
Q1	1.625	0.203	~6% of FP32
Q2	2.5	0.313	~8% of FP32
Q3	3.5	0.438	~11% of FP32
Q4	4.5	0.563	~14% of FP32
Q5	5.5	0.688	~17% of FP32
Q6	6.5	0.813	~20% of FP32
Q7	7.5	0.938	~23% of FP32
Q8	8.5	1.063	~26% of FP32
FP16	16	2.000	50% of FP32
FP32	32	4.000	100% (full)

		Quantisation (INT)								Full Precision
Model	Params	Q1 ↕	Q2 ↕	Q3 ↕	Q4 ↕	Q5 ↕	Q6 ↕	Q7 ↕	Q8 ↕	FP16 ↕	FP32 ↕
		1.6 bit	2.5 bit	3.5 bit	4.5 bit	5.5 bit	6.5 bit	7.5 bit	8.5 bit	16 bit	32 bit