How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026
Step-by-step guide to deploying Mistral 7B on AWS EC2 for production use. Covers instance selection, quantization, serving with vLLM, and cost optimization.
Mistral 7B is one of the best open-source LLMs you can self-host. It outperforms Llama 2 13B on most benchmarks while being half the size. Deploying it on AWS EC2 gives you full control, no per-token costs, and data privacy.
Here's how to do it properly.
Instance Selection
Mistral 7B in full float16 needs ~14GB VRAM. In 4-bit quantization (GPTQ/AWQ), it fits in ~4–5GB VRAM.
Recommended instances:
| Instance | GPU | VRAM | Cost/hr | Use Case |
|---|---|---|---|---|
| g4dn.xlarge | T4 | 16GB | ~$0.53 | Dev/testing, 4-bit quant |
| g4dn.2xlarge | T4 | 16GB | ~$0.75 | Low-traffic production |
| g5.xlarge | A10G | 24GB | ~$1.01 | Full precision, better throughput |
| g5.2xlarge | A10G | 24GB | ~$1.21 | Medium traffic production |
For cost-sensitive workloads, use g4dn.xlarge with 4-bit quantization — runs Mistral 7B comfortably at ~$380/month.
Setup: Launch EC2 Instance
# Use Deep Learning AMI (comes with CUDA, PyTorch pre-installed)
# AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch (Ubuntu 22.04)
# Look for it in Community AMIs
# After launch, SSH in
ssh -i your-key.pem ubuntu@<ec2-ip>
# Verify GPU
nvidia-smi
# Should show T4 or A10G with ~16GB or 24GB VRAMInstall vLLM (Best Serving Framework)
vLLM is the fastest LLM inference engine. It uses PagedAttention for 20x higher throughput than naive implementations.
# Update system
sudo apt update && sudo apt upgrade -y
# Install vLLM
pip install vllm
# For 4-bit quantization support
pip install autoawq # AWQ quantization
# or
pip install auto-gptq # GPTQ quantizationDeploy Mistral 7B
Option 1: Full Precision (g5.xlarge or larger)
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90Option 2: 4-bit Quantization (g4dn.xlarge)
# Use AWQ quantized model (fits in 4GB VRAM)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85First run downloads the model from HuggingFace (~4GB for AWQ). Set HF_HOME=/mnt/data to store on a larger EBS volume.
Test the API
vLLM serves an OpenAI-compatible API:
# Basic test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [
{"role": "user", "content": "Explain Kubernetes in 3 sentences"}
],
"max_tokens": 200,
"temperature": 0.7
}'Python client:
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://your-ec2-ip:8000/v1"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[
{"role": "user", "content": "Write a Dockerfile for a Node.js app"}
]
)
print(response.choices[0].message.content)Production Setup
1. Systemd Service (auto-restart on crash)
sudo nano /etc/systemd/system/mistral.service[Unit]
Description=Mistral 7B vLLM Server
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/home/ubuntu/.local/bin/python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.85
Restart=always
RestartSec=10
Environment=HF_HOME=/mnt/data/hf_cache
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable mistral
sudo systemctl start mistral
sudo systemctl status mistral2. Nginx Reverse Proxy + API Key Auth
# /etc/nginx/sites-available/mistral
server {
listen 443 ssl;
server_name llm.yourcompany.com;
ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem;
location / {
# Simple API key check
if ($http_authorization != "Bearer your-secret-key") {
return 401;
}
proxy_pass http://127.0.0.1:8000;
proxy_read_timeout 120s;
proxy_buffering off;
}
}3. EC2 Security Group
Inbound:
- Port 22 (SSH) — your IP only
- Port 443 (HTTPS) — 0.0.0.0/0
Outbound:
- All traffic (for HuggingFace downloads)
Cost Optimization
Use Spot Instances — Up to 70% cheaper than on-demand:
# Request spot instance via AWS CLI
aws ec2 request-spot-instances \
--instance-count 1 \
--type "one-time" \
--launch-specification '{
"ImageId": "ami-xxxxx",
"InstanceType": "g4dn.xlarge",
"KeyName": "your-key"
}' \
--spot-price "0.20"But spot instances can be interrupted. For production, use g4dn.xlarge on-demand with a 1-year reserved instance (~40% discount).
Model caching — Store the model on EBS, not EFS. EFS is too slow for model loading.
Performance Numbers (g4dn.xlarge, AWQ)
| Metric | Value |
|---|---|
| Tokens/second | ~40–60 tok/s |
| Time to first token | ~500ms |
| Concurrent requests | 4–8 |
| Monthly cost | ~$380 (on-demand) |
For higher throughput, scale horizontally behind a load balancer or use g5.2xlarge.
Self-hosting Mistral 7B on EC2 gives you unlimited requests for a flat monthly cost. At $0.53/hour on g4dn.xlarge, you break even vs OpenAI GPT-3.5 at around 500K tokens/day.
Store your model on Amazon EBS gp3 volumes — 3000 IOPS baseline, faster model loading than gp2 at the same price.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Deploy Llama 3 on AWS Bedrock — Production Guide 2026
AWS Bedrock now supports Meta's Llama 3 models. Here's how to deploy, call, and optimize Llama 3 on Bedrock for production use cases without managing GPU infrastructure.
LLM Cost Optimization in Production — Caching, Batching, Quantization 2026
LLM API bills spiral fast. Here's every technique to cut your LLM costs in production without sacrificing quality — prompt caching, request batching, model routing, and quantization.
LLM Fine-Tuning on AWS SageMaker — When, Why and How 2026
When should you fine-tune an LLM vs just prompting? How do you do it on SageMaker? This guide covers the decision framework and step-by-step fine-tuning with LoRA on AWS.