🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026

Step-by-step guide to deploying Mistral 7B on AWS EC2 for production use. Covers instance selection, quantization, serving with vLLM, and cost optimization.

DevOpsBoysMay 28, 20263 min read
Share:Tweet

Mistral 7B is one of the best open-source LLMs you can self-host. It outperforms Llama 2 13B on most benchmarks while being half the size. Deploying it on AWS EC2 gives you full control, no per-token costs, and data privacy.

Here's how to do it properly.


Instance Selection

Mistral 7B in full float16 needs ~14GB VRAM. In 4-bit quantization (GPTQ/AWQ), it fits in ~4–5GB VRAM.

Recommended instances:

InstanceGPUVRAMCost/hrUse Case
g4dn.xlargeT416GB~$0.53Dev/testing, 4-bit quant
g4dn.2xlargeT416GB~$0.75Low-traffic production
g5.xlargeA10G24GB~$1.01Full precision, better throughput
g5.2xlargeA10G24GB~$1.21Medium traffic production

For cost-sensitive workloads, use g4dn.xlarge with 4-bit quantization — runs Mistral 7B comfortably at ~$380/month.


Setup: Launch EC2 Instance

bash
# Use Deep Learning AMI (comes with CUDA, PyTorch pre-installed)
# AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch (Ubuntu 22.04)
# Look for it in Community AMIs
 
# After launch, SSH in
ssh -i your-key.pem ubuntu@<ec2-ip>
 
# Verify GPU
nvidia-smi
# Should show T4 or A10G with ~16GB or 24GB VRAM

Install vLLM (Best Serving Framework)

vLLM is the fastest LLM inference engine. It uses PagedAttention for 20x higher throughput than naive implementations.

bash
# Update system
sudo apt update && sudo apt upgrade -y
 
# Install vLLM
pip install vllm
 
# For 4-bit quantization support
pip install autoawq  # AWQ quantization
# or
pip install auto-gptq  # GPTQ quantization

Deploy Mistral 7B

Option 1: Full Precision (g5.xlarge or larger)

bash
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Option 2: 4-bit Quantization (g4dn.xlarge)

bash
# Use AWQ quantized model (fits in 4GB VRAM)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

First run downloads the model from HuggingFace (~4GB for AWQ). Set HF_HOME=/mnt/data to store on a larger EBS volume.


Test the API

vLLM serves an OpenAI-compatible API:

bash
# Basic test
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
      {"role": "user", "content": "Explain Kubernetes in 3 sentences"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Python client:

python
from openai import OpenAI
 
client = OpenAI(
    api_key="not-needed",
    base_url="http://your-ec2-ip:8000/v1"
)
 
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {"role": "user", "content": "Write a Dockerfile for a Node.js app"}
    ]
)
print(response.choices[0].message.content)

Production Setup

1. Systemd Service (auto-restart on crash)

bash
sudo nano /etc/systemd/system/mistral.service
ini
[Unit]
Description=Mistral 7B vLLM Server
After=network.target
 
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/home/ubuntu/.local/bin/python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --quantization awq \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.85
Restart=always
RestartSec=10
Environment=HF_HOME=/mnt/data/hf_cache
 
[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable mistral
sudo systemctl start mistral
sudo systemctl status mistral

2. Nginx Reverse Proxy + API Key Auth

nginx
# /etc/nginx/sites-available/mistral
server {
    listen 443 ssl;
    server_name llm.yourcompany.com;
 
    ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem;
 
    location / {
        # Simple API key check
        if ($http_authorization != "Bearer your-secret-key") {
            return 401;
        }
        
        proxy_pass http://127.0.0.1:8000;
        proxy_read_timeout 120s;
        proxy_buffering off;
    }
}

3. EC2 Security Group

Inbound:
- Port 22 (SSH) — your IP only
- Port 443 (HTTPS) — 0.0.0.0/0

Outbound:
- All traffic (for HuggingFace downloads)

Cost Optimization

Use Spot Instances — Up to 70% cheaper than on-demand:

bash
# Request spot instance via AWS CLI
aws ec2 request-spot-instances \
  --instance-count 1 \
  --type "one-time" \
  --launch-specification '{
    "ImageId": "ami-xxxxx",
    "InstanceType": "g4dn.xlarge",
    "KeyName": "your-key"
  }' \
  --spot-price "0.20"

But spot instances can be interrupted. For production, use g4dn.xlarge on-demand with a 1-year reserved instance (~40% discount).

Model caching — Store the model on EBS, not EFS. EFS is too slow for model loading.


Performance Numbers (g4dn.xlarge, AWQ)

MetricValue
Tokens/second~40–60 tok/s
Time to first token~500ms
Concurrent requests4–8
Monthly cost~$380 (on-demand)

For higher throughput, scale horizontally behind a load balancer or use g5.2xlarge.


Self-hosting Mistral 7B on EC2 gives you unlimited requests for a flat monthly cost. At $0.53/hour on g4dn.xlarge, you break even vs OpenAI GPT-3.5 at around 500K tokens/day.

Store your model on Amazon EBS gp3 volumes — 3000 IOPS baseline, faster model loading than gp2 at the same price.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments