diff --git a/docs/source/quantization_performance.mdx b/docs/source/quantization_performance.mdx new file mode 100644 index 000000000..43989bee3 --- /dev/null +++ b/docs/source/quantization_performance.mdx @@ -0,0 +1,127 @@ +# Quantization and Energy Efficiency + +Quantization is often assumed to universally reduce energy consumption by lowering memory bandwidth requirements. However, systematic benchmarking reveals that **the relationship between quantization and energy efficiency is more nuanced than commonly assumed**. This guide helps you understand when quantization improves energy efficiency — and when it may not. + +## INT8 Quantization (LLM.int8()) + +### How mixed-precision decomposition affects energy + +The default `LLM.int8()` implementation uses a mixed-precision decomposition scheme (`llm_int8_threshold=6.0`) that routes outlier features through FP16 while quantizing normal features to INT8. This design preserves model accuracy but introduces data movement overhead from continuous INT8↔FP16 type conversions. + +**Measured impact on energy consumption (RTX 4090D, batch size=1):** + +| Model | FP16 Energy (J/1k tok) | INT8 Default Energy (J/1k tok) | Energy Change | +|---|---|---|---| +| Yi-1.5-6B | 4,716 | 6,258 | **+32.7%** | +| Mistral-7B | 5,661 | 7,401 | **+30.7%** | +| Phi-3-mini (3.8B) | 3,003 | 3,940 | **+31.2%** | +| Qwen2.5-7B | 5,217 | 6,127 | **+17.4%** | + +The energy overhead is the cost of preserving accuracy. Perplexity measurements confirm the default threshold works as intended: + +| Configuration | Perplexity (Yi-1.5-6B) | Δ vs FP16 | +|---|---|---| +| FP16 (baseline) | 11.16 | — | +| INT8 Default (threshold=6.0) | 11.20 | **+0.33%** | +| INT8 Pure (threshold=0.0) | 14.00 | **+25.38%** | + +### Why threshold=0.0 is not recommended + +Setting `llm_int8_threshold=0.0` disables mixed-precision decomposition entirely, forcing all columns through INT8 quantization — including outlier activation channels that INT8 cannot represent accurately. While this eliminates the type conversion overhead, it causes **significant accuracy degradation** (+25% perplexity increase) that outweighs the marginal energy savings (−3%). + +```python +# ✅ Recommended: default threshold preserves accuracy +from transformers import BitsAndBytesConfig + +config = BitsAndBytesConfig(load_in_8bit=True) +# llm_int8_threshold defaults to 6.0 + +# ❌ Not recommended for quality-sensitive workloads +config = BitsAndBytesConfig( + load_in_8bit=True, + llm_int8_threshold=0.0, # Significant accuracy loss +) +``` + +### When to use INT8 vs FP16 + +If your primary concern is **accuracy**: use default INT8 (`threshold=6.0`). The +0.33% perplexity increase is negligible for most applications. + +If your primary concern is **energy efficiency**: consider using FP16 instead of INT8 when GPU memory allows. FP16 avoids the mixed-precision decomposition overhead while maintaining full model accuracy. + +If your primary concern is **memory**: INT8 reduces memory usage by approximately 45% compared to FP16 (e.g., 6.7 GB vs 12.1 GB for Yi-1.5-6B), making it valuable when models need to fit within GPU memory constraints. + +## NF4 Quantization + +### Small model overhead + +For models smaller than approximately 5 billion parameters on fast GPUs, NF4 quantization can **increase** energy consumption despite reducing memory usage. This occurs because the dequantization compute cost outweighs the memory bandwidth savings when the model already fits comfortably in GPU memory. + +**Measured impact (RTX 5090, batch size=1):** + +| Model | FP16 Energy (J/1k tok) | NF4 Energy (J/1k tok) | Energy Change | +|---|---|---|---| +| TinyLlama-1.1B | 1,659 | 2,098 | **+26.5%** | +| Qwen2-1.5B | 2,411 | 3,120 | **+29.4%** | +| Qwen2.5-3B | 3,383 | 3,780 | **+11.7%** | +| Qwen2-7B | 5,509 | 4,878 | **−11.4%** | + +### Crossover point + +Energy savings from NF4 quantization begin at approximately **5 billion parameters**, validated across both RTX 5090 (Blackwell) and RTX 4090D (Ada Lovelace) architectures. For models above this threshold, NF4 consistently reduces energy consumption: + +**RTX 4090D results (models ≥6B):** + +| Model | NF4 Energy Change vs FP16 | +|---|---| +| Yi-1.5-6B | **−30.2%** | +| Mistral-7B | **−34.5%** | +| Qwen2.5-7B | **−32.7%** | + +## Batch size impact + +Energy efficiency improves dramatically with larger batch sizes. Single-request inference (batch size=1) wastes significant GPU capacity: + +**A800 + Mistral-7B + Pure INT8 (threshold=0.0):** + +| Batch Size | Energy per Request (J) | Δ vs BS=1 | GPU Utilization | +|---|---|---|---| +| 1 | 1,768 | — | 45% | +| 8 | 284 | −84% | 50% | +| 16 | 205 | −88% | 77% | +| 64 | 76 | −96% | 91% | + +For production deployments, using batch size ≥8 provides the most significant energy reduction regardless of quantization configuration. + +## Configuration guidelines + +### By priority + +**Memory-constrained** (model doesn't fit in FP16): +- Use NF4 for ≥5B parameter models +- Use INT8 when NF4 is not available or when you need higher accuracy than NF4 + +**Accuracy-first** (most production workloads): +- Use default INT8 (`threshold=6.0`) — only +0.33% PPL increase +- Or use FP16 if memory allows + +**Energy-first** (cost-sensitive batch processing): +- Use FP16 when memory allows (avoids INT8 mixed-precision overhead) +- Use NF4 for models ≥5B parameters (best energy efficiency) +- Maximize batch size (BS≥8 gives 84%+ energy reduction vs BS=1) + +### By model size + +| Model Size | Recommended for Energy Efficiency | +|---|---| +| < 3B parameters | FP16 (quantization adds overhead on fast GPUs) | +| 3B–5B parameters | FP16 or NF4 (test on your hardware) | +| ≥ 5B parameters | NF4 (consistent energy savings of 30–35%) | + +## Methodology + +All measurements were collected using NVML-based power monitoring at 10 Hz sampling rate, with n=10 repetitions per configuration and coefficient of variation < 3%. Hardware platforms: RTX 5090 (Blackwell), RTX 4090D (Ada Lovelace), A800 (Ampere). Perplexity was measured on WikiText-2 (test split). + +Full benchmark data, scripts, and interactive dashboard are available at: +- [Benchmark repository](https://github.com/hongping-zh/ecocompute-ai) +- [Interactive dashboard](https://hongping-zh.github.io/ecocompute-dynamic-eval/)