🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

vLLM Optimization — Batching, Quantization, and Throughput Tuning 2026

vLLM is the fastest LLM inference engine but out-of-the-box settings leave performance on the table. Here's how to tune batching, quantization, and memory to maximize throughput.

DevOpsBoysJun 4, 20264 min read
Share:Tweet

vLLM's PagedAttention gives you 20x higher throughput than naive inference. But the default settings aren't tuned for your hardware. Here's how to squeeze out maximum performance.


Baseline: What vLLM Does Differently

Traditional inference: each request gets its own KV cache allocation. Memory is wasted on padding.

vLLM PagedAttention: KV cache is managed like OS virtual memory — allocated in fixed-size pages, shared when possible. Sequences pack efficiently into GPU memory.

Result: higher batch sizes → higher throughput.


Key Parameters to Tune

bash
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-memory-utilization 0.90 \    # % of GPU VRAM for model + KV cache
  --max-model-len 8192 \             # Max sequence length
  --max-num-batched-tokens 8192 \    # Tokens processed per iteration
  --max-num-seqs 256 \               # Max concurrent sequences
  --tensor-parallel-size 1 \         # GPUs for tensor parallelism
  --dtype float16 \                  # Precision
  --quantization awq                 # Quantization method

GPU Memory Utilization

bash
# Default: 0.90 (90% of VRAM)
--gpu-memory-utilization 0.95   # More KV cache = more concurrent requests
 
# Lower if you see CUDA OOM:
--gpu-memory-utilization 0.85

On a T4 (16GB), at 0.90 utilization with Mistral 7B AWQ:

  • ~10GB for model weights
  • ~4.4GB for KV cache → ~40 concurrent sequences

Quantization: AWQ vs GPTQ vs FP8

bash
# AWQ (recommended for most cases)
--quantization awq \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
 
# GPTQ
--quantization gptq \
--model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ
 
# FP8 (requires H100/A100 with compute capability 9.0+)
--quantization fp8 \
--dtype float16

Benchmark comparison (Mistral 7B, T4 GPU):

QuantizationVRAMTok/sQuality
float16 (full)14GB35Best
AWQ 4-bit4.5GB42-2% vs full
GPTQ 4-bit4.5GB38-3% vs full
GGUF (llama.cpp)4GB20-5% vs full

AWQ wins on throughput + quality balance.


Continuous Batching Tuning

vLLM uses continuous batching — requests join and leave mid-batch. Two key params:

bash
# Max tokens processed in one forward pass
--max-num-batched-tokens 16384  # Increase for higher throughput (needs more VRAM)
 
# Max concurrent sequences
--max-num-seqs 512  # How many requests in flight simultaneously

Finding the sweet spot:

python
# Load test to find optimal settings
import asyncio
import httpx
import time
 
async def send_request(client, prompt):
    start = time.time()
    resp = await client.post("http://localhost:8000/v1/completions", json={
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": prompt,
        "max_tokens": 200
    })
    return time.time() - start
 
async def load_test(concurrency: int, num_requests: int):
    async with httpx.AsyncClient(timeout=60) as client:
        tasks = [send_request(client, "Explain Kubernetes in 3 sentences") 
                 for _ in range(num_requests)]
        
        start = time.time()
        latencies = await asyncio.gather(*tasks)
        total_time = time.time() - start
        
        print(f"Concurrency: {concurrency}")
        print(f"Throughput: {num_requests/total_time:.1f} req/s")
        print(f"P50 latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
        print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.2f}s")
 
asyncio.run(load_test(50, 200))

Multi-GPU with Tensor Parallelism

For models larger than single GPU VRAM (e.g., Llama 3 70B):

bash
# Split model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \    # Requires 2 GPUs
  --gpu-memory-utilization 0.90

For Llama 3 70B:

  • float16: needs 140GB → 2x A100 80GB
  • AWQ 4-bit: needs 40GB → 1x A100 or 2x A10G

Speculative Decoding (2x Speed on Short Outputs)

bash
# Use a small draft model to predict tokens, large model verifies
# Huge speedup when outputs are short and predictable
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-model "meta-llama/Meta-Llama-3-1B" \
  --num-speculative-tokens 5

Best for: code completion, structured output, SQL generation — where next tokens are predictable.


Production Config: g5.2xlarge (A10G, 24GB)

bash
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 256 \
  --dtype float16 \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name mistral-7b \
  --uvicorn-log-level warning

Expected performance: ~80 tokens/second at moderate load, ~60 req/min with 200-token responses.


Monitor Performance

python
# vLLM exposes Prometheus metrics at /metrics
# Key metrics to watch:
# vllm:num_requests_running — current batch size
# vllm:gpu_cache_usage_perc — KV cache utilization
# vllm:generation_tokens_total — throughput
# vllm:time_to_first_token_seconds — TTFT latency
 
# If gpu_cache_usage_perc > 95% consistently → reduce max_num_seqs
# If num_requests_running is always low → increase max_num_seqs

Start with AWQ quantization + 90% GPU memory utilization. Run a load test at your expected concurrency. Tune max-num-seqs until P99 latency is acceptable. For most teams, that's the 80% solution — no need to go deeper.

Run your vLLM instances on AWS EC2 G5 instances — A10G GPUs with 24GB VRAM, good price/performance for production LLM serving.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments