Memory footprint by quantisation level — Q1 through FP32
| Quantisation (INT) | Full Precision | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Params | Q1 ↕ | Q2 ↕ | Q3 ↕ | Q4 ↕ | Q5 ↕ | Q6 ↕ | Q7 ↕ | Q8 ↕ | FP16 ↕ | FP32 ↕ |
| 1.6 bit | 2.5 bit | 3.5 bit | 4.5 bit | 5.5 bit | 6.5 bit | 7.5 bit | 8.5 bit | 16 bit | 32 bit | ||
Each weight is stored as a fixed number of bits depending on quantisation level. Dividing by 8 converts bits → bytes, then dividing by 1.073×10⁹ converts to GB. A flat 18% overhead is added to account for the KV cache, activation buffers, and CUDA runtime costs at a typical inference context length.