vLLM Optimization — Batching, Quantization, and Throughput Tuning 2026
vLLM is the fastest LLM inference engine but out-of-the-box settings leave performance on the table. Here's how to tune batching, quantization, and memory to maximize throughput.
vLLM's PagedAttention gives you 20x higher throughput than naive inference. But the default settings aren't tuned for your hardware. Here's how to squeeze out maximum performance.
Baseline: What vLLM Does Differently
Traditional inference: each request gets its own KV cache allocation. Memory is wasted on padding.
vLLM PagedAttention: KV cache is managed like OS virtual memory — allocated in fixed-size pages, shared when possible. Sequences pack efficiently into GPU memory.
Result: higher batch sizes → higher throughput.
Key Parameters to Tune
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-memory-utilization 0.90 \ # % of GPU VRAM for model + KV cache
--max-model-len 8192 \ # Max sequence length
--max-num-batched-tokens 8192 \ # Tokens processed per iteration
--max-num-seqs 256 \ # Max concurrent sequences
--tensor-parallel-size 1 \ # GPUs for tensor parallelism
--dtype float16 \ # Precision
--quantization awq # Quantization methodGPU Memory Utilization
# Default: 0.90 (90% of VRAM)
--gpu-memory-utilization 0.95 # More KV cache = more concurrent requests
# Lower if you see CUDA OOM:
--gpu-memory-utilization 0.85On a T4 (16GB), at 0.90 utilization with Mistral 7B AWQ:
- ~10GB for model weights
- ~4.4GB for KV cache → ~40 concurrent sequences
Quantization: AWQ vs GPTQ vs FP8
# AWQ (recommended for most cases)
--quantization awq \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
# GPTQ
--quantization gptq \
--model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ
# FP8 (requires H100/A100 with compute capability 9.0+)
--quantization fp8 \
--dtype float16Benchmark comparison (Mistral 7B, T4 GPU):
| Quantization | VRAM | Tok/s | Quality |
|---|---|---|---|
| float16 (full) | 14GB | 35 | Best |
| AWQ 4-bit | 4.5GB | 42 | -2% vs full |
| GPTQ 4-bit | 4.5GB | 38 | -3% vs full |
| GGUF (llama.cpp) | 4GB | 20 | -5% vs full |
AWQ wins on throughput + quality balance.
Continuous Batching Tuning
vLLM uses continuous batching — requests join and leave mid-batch. Two key params:
# Max tokens processed in one forward pass
--max-num-batched-tokens 16384 # Increase for higher throughput (needs more VRAM)
# Max concurrent sequences
--max-num-seqs 512 # How many requests in flight simultaneouslyFinding the sweet spot:
# Load test to find optimal settings
import asyncio
import httpx
import time
async def send_request(client, prompt):
start = time.time()
resp = await client.post("http://localhost:8000/v1/completions", json={
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": prompt,
"max_tokens": 200
})
return time.time() - start
async def load_test(concurrency: int, num_requests: int):
async with httpx.AsyncClient(timeout=60) as client:
tasks = [send_request(client, "Explain Kubernetes in 3 sentences")
for _ in range(num_requests)]
start = time.time()
latencies = await asyncio.gather(*tasks)
total_time = time.time() - start
print(f"Concurrency: {concurrency}")
print(f"Throughput: {num_requests/total_time:.1f} req/s")
print(f"P50 latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.2f}s")
asyncio.run(load_test(50, 200))Multi-GPU with Tensor Parallelism
For models larger than single GPU VRAM (e.g., Llama 3 70B):
# Split model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 2 \ # Requires 2 GPUs
--gpu-memory-utilization 0.90For Llama 3 70B:
- float16: needs 140GB → 2x A100 80GB
- AWQ 4-bit: needs 40GB → 1x A100 or 2x A10G
Speculative Decoding (2x Speed on Short Outputs)
# Use a small draft model to predict tokens, large model verifies
# Huge speedup when outputs are short and predictable
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-model "meta-llama/Meta-Llama-3-1B" \
--num-speculative-tokens 5Best for: code completion, structured output, SQL generation — where next tokens are predictable.
Production Config: g5.2xlarge (A10G, 24GB)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--gpu-memory-utilization 0.92 \
--max-model-len 8192 \
--max-num-batched-tokens 16384 \
--max-num-seqs 256 \
--dtype float16 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name mistral-7b \
--uvicorn-log-level warningExpected performance: ~80 tokens/second at moderate load, ~60 req/min with 200-token responses.
Monitor Performance
# vLLM exposes Prometheus metrics at /metrics
# Key metrics to watch:
# vllm:num_requests_running — current batch size
# vllm:gpu_cache_usage_perc — KV cache utilization
# vllm:generation_tokens_total — throughput
# vllm:time_to_first_token_seconds — TTFT latency
# If gpu_cache_usage_perc > 95% consistently → reduce max_num_seqs
# If num_requests_running is always low → increase max_num_seqsStart with AWQ quantization + 90% GPU memory utilization. Run a load test at your expected concurrency. Tune max-num-seqs until P99 latency is acceptable. For most teams, that's the 80% solution — no need to go deeper.
Run your vLLM instances on AWS EC2 G5 instances — A10G GPUs with 24GB VRAM, good price/performance for production LLM serving.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered DevOps Chatbot with Streamlit on Kubernetes
Build a DevOps assistant chatbot that answers infrastructure questions, generates kubectl commands, and explains errors — deployed as a Streamlit app on Kubernetes.
Build LLM-Powered Runbook Automation with Haystack and Kubernetes
Turn your static runbooks into an AI system that answers 'what do I do when X happens' with step-by-step instructions retrieved from your actual documentation.
Build a Natural Language kubectl — Ask Questions to Your Cluster
Build a CLI tool that lets you describe what you want in plain English and generates the correct kubectl command — powered by Claude API.