vLLM Optimization — Batching, Quantization, and Throughput Tuning 2026

vLLM is the fastest LLM inference engine but out-of-the-box settings leave performance on the table. Here's how to tune batching, quantization, and memory to maximize throughput.

vLLM's PagedAttention gives you 20x higher throughput than naive inference. But the default settings aren't tuned for your hardware. Here's how to squeeze out maximum performance.

Baseline: What vLLM Does Differently

Traditional inference: each request gets its own KV cache allocation. Memory is wasted on padding.

vLLM PagedAttention: KV cache is managed like OS virtual memory — allocated in fixed-size pages, shared when possible. Sequences pack efficiently into GPU memory.

Result: higher batch sizes → higher throughput.

Key Parameters to Tune

bash

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-memory-utilization 0.90 \    # % of GPU VRAM for model + KV cache
  --max-model-len 8192 \             # Max sequence length
  --max-num-batched-tokens 8192 \    # Tokens processed per iteration
  --max-num-seqs 256 \               # Max concurrent sequences
  --tensor-parallel-size 1 \         # GPUs for tensor parallelism
  --dtype float16 \                  # Precision
  --quantization awq                 # Quantization method

GPU Memory Utilization

bash

# Default: 0.90 (90% of VRAM)
--gpu-memory-utilization 0.95   # More KV cache = more concurrent requests
 
# Lower if you see CUDA OOM:
--gpu-memory-utilization 0.85

On a T4 (16GB), at 0.90 utilization with Mistral 7B AWQ:

~10GB for model weights
~4.4GB for KV cache → ~40 concurrent sequences

Quantization: AWQ vs GPTQ vs FP8

bash

# AWQ (recommended for most cases)
--quantization awq \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
 
# GPTQ
--quantization gptq \
--model TheBloke/Mistral-7B-Instruct-v0.2-GPTQ
 
# FP8 (requires H100/A100 with compute capability 9.0+)
--quantization fp8 \
--dtype float16

Benchmark comparison (Mistral 7B, T4 GPU):

Quantization	VRAM	Tok/s	Quality
float16 (full)	14GB	35	Best
AWQ 4-bit	4.5GB	42	-2% vs full
GPTQ 4-bit	4.5GB	38	-3% vs full
GGUF (llama.cpp)	4GB	20	-5% vs full

AWQ wins on throughput + quality balance.

Continuous Batching Tuning

vLLM uses continuous batching — requests join and leave mid-batch. Two key params:

bash

# Max tokens processed in one forward pass
--max-num-batched-tokens 16384  # Increase for higher throughput (needs more VRAM)
 
# Max concurrent sequences
--max-num-seqs 512  # How many requests in flight simultaneously

Finding the sweet spot:

python

# Load test to find optimal settings
import asyncio
import httpx
import time
 
async def send_request(client, prompt):
    start = time.time()
    resp = await client.post("http://localhost:8000/v1/completions", json={
        "model": "mistralai/Mistral-7B-Instruct-v0.3",
        "prompt": prompt,
        "max_tokens": 200
    })
    return time.time() - start
 
async def load_test(concurrency: int, num_requests: int):
    async with httpx.AsyncClient(timeout=60) as client:
        tasks = [send_request(client, "Explain Kubernetes in 3 sentences") 
                 for _ in range(num_requests)]
        
        start = time.time()
        latencies = await asyncio.gather(*tasks)
        total_time = time.time() - start
        
        print(f"Concurrency: {concurrency}")
        print(f"Throughput: {num_requests/total_time:.1f} req/s")
        print(f"P50 latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
        print(f"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)]:.2f}s")
 
asyncio.run(load_test(50, 200))

Multi-GPU with Tensor Parallelism

For models larger than single GPU VRAM (e.g., Llama 3 70B):

bash

# Split model across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \    # Requires 2 GPUs
  --gpu-memory-utilization 0.90

For Llama 3 70B:

float16: needs 140GB → 2x A100 80GB
AWQ 4-bit: needs 40GB → 1x A100 or 2x A10G

Speculative Decoding (2x Speed on Short Outputs)

bash

# Use a small draft model to predict tokens, large model verifies
# Huge speedup when outputs are short and predictable
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-model "meta-llama/Meta-Llama-3-1B" \
  --num-speculative-tokens 5

Best for: code completion, structured output, SQL generation — where next tokens are predictable.

Production Config: g5.2xlarge (A10G, 24GB)

bash

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 256 \
  --dtype float16 \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name mistral-7b \
  --uvicorn-log-level warning

Expected performance: ~80 tokens/second at moderate load, ~60 req/min with 200-token responses.

Monitor Performance

python

# vLLM exposes Prometheus metrics at /metrics
# Key metrics to watch:
# vllm:num_requests_running — current batch size
# vllm:gpu_cache_usage_perc — KV cache utilization
# vllm:generation_tokens_total — throughput
# vllm:time_to_first_token_seconds — TTFT latency
 
# If gpu_cache_usage_perc > 95% consistently → reduce max_num_seqs
# If num_requests_running is always low → increase max_num_seqs

Start with AWQ quantization + 90% GPU memory utilization. Run a load test at your expected concurrency. Tune max-num-seqs until P99 latency is acceptable. For most teams, that's the 80% solution — no need to go deeper.

Run your vLLM instances on AWS EC2 G5 instances — A10G GPUs with 24GB VRAM, good price/performance for production LLM serving.

vLLM Optimization — Batching, Quantization, and Throughput Tuning 2026

Baseline: What vLLM Does Differently

Key Parameters to Tune

GPU Memory Utilization

Quantization: AWQ vs GPTQ vs FP8

Continuous Batching Tuning

Multi-GPU with Tensor Parallelism

Speculative Decoding (2x Speed on Short Outputs)

Production Config: g5.2xlarge (A10G, 24GB)

Monitor Performance

Stay ahead of the curve

Related Articles

Build an AI Deployment Health Checker with Claude API and Kubernetes

Build an AI Kubernetes Deployment Readiness Checker with Claude API

Build an AI Kubernetes Runbook Generator with Claude API

Comments