🎯

gptq

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Enables 4-bit quantization of large language models, reducing memory by 4Γ— and boosting inference speed on consumer GPUs with minimal accuracy loss.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

gptq

Installation

pip installInstall Python package
pip install auto-gptq
pip installInstall Python package
pip install auto-gptq[triton]
pip installInstall Python package
pip install auto-gptq --no-build-isolation
pip installInstall Python package
pip install auto-gptq transformers accelerate
πŸ“– Extracted from docs: ovachiever/droid-tings
17Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4Γ— memory reduction with <2% perplexity degradation, or for faster inference (3-4Γ— speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

Overview

# GPTQ (Generative Pre-trained Transformer Quantization)

Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.

When to use GPTQ

Use GPTQ when:

  • Need to fit large models (70B+) on limited GPU memory
  • Want 4Γ— memory reduction with <2% accuracy loss
  • Deploying on consumer GPUs (RTX 4090, 3090)
  • Need faster inference (3-4Γ— speedup vs FP16)

Use AWQ instead when:

  • Need slightly better accuracy (<1% loss)
  • Have newer GPUs (Ampere, Ada)
  • Want Marlin kernel support (2Γ— faster on some GPUs)

Use bitsandbytes instead when:

  • Need simple integration with transformers
  • Want 8-bit quantization (less compression, better quality)
  • Don't need pre-quantized model files

Quick start

Installation

```bash

# Install AutoGPTQ

pip install auto-gptq

# With Triton (Linux only, faster)

pip install auto-gptq[triton]

# With CUDA extensions (faster)

pip install auto-gptq --no-build-isolation

# Full installation

pip install auto-gptq transformers accelerate

```

Load pre-quantized model

```python

from transformers import AutoTokenizer

from auto_gptq import AutoGPTQForCausalLM

# Load quantized model from HuggingFace

model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

model = AutoGPTQForCausalLM.from_quantized(

model_name,

device="cuda:0",

use_triton=False # Set True on Linux for speed

)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate

prompt = "Explain quantum computing"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0]))

```

Quantize your own model

```python

from transformers import AutoTokenizer

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

from datasets import load_dataset

# Load model

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Quantization config

quantize_config = BaseQuantizeConfig(

bits=4, # 4-bit quantization

group_size=128, # Group size (recommended: 128)

desc_act=False, # Activation order (False for CUDA kernel)

damp_percent=0.01 # Dampening factor

)

# Load model for quantization

model = AutoGPTQForCausalLM.from_pretrained(

model_name,

quantize_config=quantize_config

)

# Prepare calibration data

dataset = load_dataset("c4", split="train", streaming=True)

calibration_data = [

tokenizer(example["text"])["input_ids"][:512]

for example in dataset.take(128)

]

# Quantize

model.quantize(calibration_data)

# Save quantized model

model.save_quantized("llama-2-7b-gptq")

tokenizer.save_pretrained("llama-2-7b-gptq")

# Push to HuggingFace

model.push_to_hub("username/llama-2-7b-gptq")

```

Group-wise quantization

How GPTQ works:

  1. Group weights: Divide each weight matrix into groups (typically 128 elements)
  2. Quantize per-group: Each group has its own scale/zero-point
  3. Minimize error: Uses Hessian information to minimize quantization error
  4. Result: 4-bit weights with near-FP16 accuracy

Group size trade-off:

| Group Size | Model Size | Accuracy | Speed | Recommendation |

|------------|------------|----------|-------|----------------|

| -1 (per-column) | Smallest | Best | Slowest | Research only |

| 32 | Smaller | Better | Slower | High accuracy needed |

| 128 | Medium | Good | Fast | Recommended default |

| 256 | Larger | Lower | Faster | Speed critical |

| 1024 | Largest | Lowest | Fastest | Not recommended |

Example:

```

Weight matrix: [1024, 4096] = 4.2M elements

Group size = 128:

  • Groups: 4.2M / 128 = 32,768 groups
  • Each group: own 4-bit scale + zero-point
  • Result: Better granularity β†’ better accuracy

```

Quantization configurations

Standard 4-bit (recommended)

```python

from auto_gptq import BaseQuantizeConfig

config = BaseQuantizeConfig(

bits=4, # 4-bit quantization

group_size=128, # Standard group size

desc_act=False, # Faster CUDA kernel

damp_percent=0.01 # Dampening factor

)

```

Performance:

  • Memory: 4Γ— reduction (70B model: 140GB β†’ 35GB)
  • Accuracy: ~1.5% perplexity increase
  • Speed: 3-4Γ— faster than FP16

High accuracy (3-bit with larger groups)

```python

config = BaseQuantizeConfig(

bits=3, # 3-bit (more compression)

group_size=128, # Keep standard group size

desc_act=True, # Better accuracy (slower)

damp_percent=0.01

)

```

Trade-off:

  • Memory: 5Γ— reduction
  • Accuracy: ~3% perplexity increase
  • Speed: 5Γ— faster (but less accurate)

Maximum accuracy (4-bit with small groups)

```python

config = BaseQuantizeConfig(

bits=4,

group_size=32, # Smaller groups (better accuracy)

desc_act=True, # Activation reordering

damp_percent=0.005 # Lower dampening

)

```

Trade-off:

  • Memory: 3.5Γ— reduction (slightly larger)
  • Accuracy: ~0.8% perplexity increase (best)
  • Speed: 2-3Γ— faster (kernel overhead)

Kernel backends

ExLlamaV2 (default, fastest)

```python

model = AutoGPTQForCausalLM.from_quantized(

model_name,

device="cuda:0",

use_exllama=True, # Use ExLlamaV2

exllama_config={"version": 2}

)

```

Performance: 1.5-2Γ— faster than Triton

Marlin (Ampere+ GPUs)

```python

# Quantize with Marlin format

config = BaseQuantizeConfig(

bits=4,

group_size=128,

desc_act=False # Required for Marlin

)

model.quantize(calibration_data, use_marlin=True)

# Load with Marlin

model = AutoGPTQForCausalLM.from_quantized(

model_name,

device="cuda:0",

use_marlin=True # 2Γ— faster on A100/H100

)

```

Requirements:

  • NVIDIA Ampere or newer (A100, H100, RTX 40xx)
  • Compute capability β‰₯ 8.0

Triton (Linux only)

```python

model = AutoGPTQForCausalLM.from_quantized(

model_name,

device="cuda:0",

use_triton=True # Linux only

)

```

Performance: 1.2-1.5Γ— faster than CUDA backend

Integration with transformers

Direct transformers usage

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model (transformers auto-detects GPTQ)

model = AutoModelForCausalLM.from_pretrained(

"TheBloke/Llama-2-13B-Chat-GPTQ",

device_map="auto",

trust_remote_code=False

)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")

# Use like any transformers model

inputs = tokenizer("Hello", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=100)

```

QLoRA fine-tuning (GPTQ + LoRA)

```python

from transformers import AutoModelForCausalLM

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

# Load GPTQ model

model = AutoModelForCausalLM.from_pretrained(

"TheBloke/Llama-2-7B-GPTQ",

device_map="auto"

)

# Prepare for LoRA training

model = prepare_model_for_kbit_training(model)

# LoRA config

lora_config = LoraConfig(

r=16,

lora_alpha=32,

target_modules=["q_proj", "v_proj"],

lora_dropout=0.05,

bias="none",

task_type="CAUSAL_LM"

)

# Add LoRA adapters

model = get_peft_model(model, lora_config)

# Fine-tune (memory efficient!)

# 70B model trainable on single A100 80GB

```

Performance benchmarks

Memory reduction

| Model | FP16 | GPTQ 4-bit | Reduction |

|-------|------|------------|-----------|

| Llama 2-7B | 14 GB | 3.5 GB | 4Γ— |

| Llama 2-13B | 26 GB | 6.5 GB | 4Γ— |

| Llama 2-70B | 140 GB | 35 GB | 4Γ— |

| Llama 3-405B | 810 GB | 203 GB | 4Γ— |

Enables:

  • 70B on single A100 80GB (vs 2Γ— A100 needed for FP16)
  • 405B on 3Γ— A100 80GB (vs 11Γ— A100 needed for FP16)
  • 13B on RTX 4090 24GB (vs OOM with FP16)

Inference speed (Llama 2-7B, A100)

| Precision | Tokens/sec | vs FP16 |

|-----------|------------|---------|

| FP16 | 25 tok/s | 1Γ— |

| GPTQ 4-bit (CUDA) | 85 tok/s | 3.4Γ— |

| GPTQ 4-bit (ExLlama) | 105 tok/s | 4.2Γ— |

| GPTQ 4-bit (Marlin) | 120 tok/s | 4.8Γ— |

Accuracy (perplexity on WikiText-2)

| Model | FP16 | GPTQ 4-bit (g=128) | Degradation |

|-------|------|---------------------|-------------|

| Llama 2-7B | 5.47 | 5.55 | +1.5% |

| Llama 2-13B | 4.88 | 4.95 | +1.4% |

| Llama 2-70B | 3.32 | 3.38 | +1.8% |

Excellent quality preservation - less than 2% degradation!

Common patterns

Multi-GPU deployment

```python

# Automatic device mapping

model = AutoGPTQForCausalLM.from_quantized(

"TheBloke/Llama-2-70B-GPTQ",

device_map="auto", # Automatically split across GPUs

max_memory={0: "40GB", 1: "40GB"} # Limit per GPU

)

# Manual device mapping

device_map = {

"model.embed_tokens": 0,

"model.layers.0-39": 0, # First 40 layers on GPU 0

"model.layers.40-79": 1, # Last 40 layers on GPU 1

"model.norm": 1,

"lm_head": 1

}

model = AutoGPTQForCausalLM.from_quantized(

model_name,

device_map=device_map

)

```

CPU offloading

```python

# Offload some layers to CPU (for very large models)

model = AutoGPTQForCausalLM.from_quantized(

"TheBloke/Llama-2-405B-GPTQ",

device_map="auto",

max_memory={

0: "80GB", # GPU 0

1: "80GB", # GPU 1

2: "80GB", # GPU 2

"cpu": "200GB" # Offload overflow to CPU

}

)

```

Batch inference

```python

# Process multiple prompts efficiently

prompts = [

"Explain AI",

"Explain ML",

"Explain DL"

]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")

outputs = model.generate(

**inputs,

max_new_tokens=100,

pad_token_id=tokenizer.eos_token_id

)

for i, output in enumerate(outputs):

print(f"Prompt {i}: {tokenizer.decode(output)}")

```

Finding pre-quantized models

TheBloke on HuggingFace:

  • https://huggingface.co/TheBloke
  • 1000+ models in GPTQ format
  • Multiple group sizes (32, 128)
  • Both CUDA and Marlin formats

Search:

```bash

# Find GPTQ models on HuggingFace

https://huggingface.co/models?library=gptq

```

Download:

```python

from auto_gptq import AutoGPTQForCausalLM

# Automatically downloads from HuggingFace

model = AutoGPTQForCausalLM.from_quantized(

"TheBloke/Llama-2-70B-Chat-GPTQ",

device="cuda:0"

)

```

Supported models

  • LLaMA family: Llama 2, Llama 3, Code Llama
  • Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: V2, V3
  • Phi: Phi-2, Phi-3
  • Yi, Falcon, BLOOM, OPT
  • 100+ models on HuggingFace

References

  • [Calibration Guide](references/calibration.md) - Dataset selection, quantization process, quality optimization
  • [Integration Guide](references/integration.md) - Transformers, PEFT, vLLM, TensorRT-LLM
  • [Troubleshooting](references/troubleshooting.md) - Common issues, performance optimization

Resources

  • GitHub: https://github.com/AutoGPTQ/AutoGPTQ
  • Paper: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
  • Models: https://huggingface.co/models?library=gptq
  • Discord: https://discord.gg/autogptq