[Performance/Energy] 4-bit NF4 shows significant energy efficiency penalty on Blackwell (RTX 5090) for small models

Hi bitsandbytes team,

I'm working on an open-source Green AI framework called **EcoCompute AI**, focused on auditing the energy-economic impact of LLMs.

During our benchmarking on the new **NVIDIA RTX 5090 (Blackwell, sm_120)**, we observed a significant **"Energy Efficiency Paradox"** when using 4-bit NF4 quantization on models under 3B parameters.

## Environment

- **GPU**: NVIDIA GeForce RTX 5090 (32GB GDDR7, Blackwell sm_120)
- **PyTorch**: 2.10.0+cu128
- **CUDA**: 12.8
- **bitsandbytes**: 0.49.1
- **transformers**: 5.0.0
- **OS**: Ubuntu 22.04 (AutoDL Cloud)

## The Data (RTX 5090)

| Model | Config | Throughput (tok/s) | Avg Power (W) | Energy (J/1k Tokens) | Change |
| :--- | :--- | ---: | ---: | ---: | :--- |
| TinyLlama-1.1B | FP16 | 94.87 | 157.45 | 1659.00 | baseline |
| TinyLlama-1.1B | 4-bit NF4 | 55.79 | 117.02 | 2098.44 | **+26.5%** ⚠️ |
| Qwen2-1.5B | FP16 | 71.45 | 172.30 | 2411.09 | baseline |
| Qwen2-1.5B | 4-bit NF4 | 41.57 | 129.83 | 3120.49 | **+29.4%** ⚠️ |
| Qwen2-7B | FP16 | 70.47 | 388.34 | 5508.56 | baseline |
| Qwen2-7B | 4-bit NF4 | 41.40 | 201.88 | 4877.88 | **-11.4%** ✅ |

## Key Findings

### 1. The Energy Trap
For models ≤1.5B parameters, although 4-bit reduced average power by ~25%, the **~41% drop in throughput** (due to de-quantization overhead) resulted in **26-29% MORE energy consumed per token** compared to FP16.

### 2. The Crossover Point
The energy benefit only becomes positive when the model scales to ~7B parameters, where memory bandwidth savings outweigh the de-quantization penalty.

### 3. Throughput Consistency
Across all model sizes, 4-bit NF4 throughput is consistently **~58-59% of FP16** throughput. This ratio appears to be an inherent characteristic of the current implementation.

## Potential Impact

On ultra-fast hardware like the RTX 5090, **blind quantization of small models might inadvertently increase the carbon footprint** of AI deployments. This is counterintuitive for users who assume quantization always saves energy.

## Questions for the Team

1. Is this throughput penalty a known bottleneck for the **Blackwell architecture's de-quantization kernels**?
2. Are there plans to optimize bitsandbytes for **sm_120** to bridge this throughput gap?
3. Would it be possible to add a **warning or recommendation** in the documentation about the energy trade-off for small models on high-performance GPUs?

## Reproduction Code

```python
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def benchmark(model_id, use_4bit=False):
    kwargs = {"torch_dtype": torch.float16}
    if use_4bit:
        kwargs = {"quantization_config": BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_quant_type="nf4"
        )}
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", **kwargs)
    
    inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
    
    # Warmup
    model.generate(**inputs, max_new_tokens=20)
    torch.cuda.synchronize()
    
    # Benchmark
    start = time.time()
    for _ in range(10):
        model.generate(**inputs, max_new_tokens=128, do_sample=True)
    torch.cuda.synchronize()
    
    print(f"{'4-bit' if use_4bit else 'FP16'}: {time.time() - start:.2f}s")

# Run on TinyLlama-1.1B
benchmark("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_4bit=False)
benchmark("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_4bit=True)
```

## Additional Context

- Full technical report available at: [TechRxiv Link - to be added]
- EcoCompute AI repository: [GitHub Link - to be added]
- Happy to provide more telemetry data (power traces, memory profiles) if needed.

Thank you for your amazing work on bitsandbytes! 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance/Energy] 4-bit NF4 shows significant energy efficiency penalty on Blackwell (RTX 5090) for small models #1851

Environment

The Data (RTX 5090)

Key Findings

1. The Energy Trap

2. The Crossover Point

3. Throughput Consistency

Potential Impact

Questions for the Team

Reproduction Code

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Config	Throughput (tok/s)	Avg Power (W)	Energy (J/1k Tokens)	Change
TinyLlama-1.1B	FP16	94.87	157.45	1659.00	baseline
TinyLlama-1.1B	4-bit NF4	55.79	117.02	2098.44	+26.5% ⚠️
Qwen2-1.5B	FP16	71.45	172.30	2411.09	baseline
Qwen2-1.5B	4-bit NF4	41.57	129.83	3120.49	+29.4% ⚠️
Qwen2-7B	FP16	70.47	388.34	5508.56	baseline
Qwen2-7B	4-bit NF4	41.40	201.88	4877.88	-11.4% ✅

Uh oh!

[Performance/Energy] 4-bit NF4 shows significant energy efficiency penalty on Blackwell (RTX 5090) for small models #1851

Description

Environment

The Data (RTX 5090)

Key Findings

1. The Energy Trap

2. The Crossover Point

3. Throughput Consistency

Potential Impact

Questions for the Team

Reproduction Code

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions