How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026

Step-by-step guide to deploying Mistral 7B on AWS EC2 for production use. Covers instance selection, quantization, serving with vLLM, and cost optimization.

Mistral 7B is one of the best open-source LLMs you can self-host. It outperforms Llama 2 13B on most benchmarks while being half the size. Deploying it on AWS EC2 gives you full control, no per-token costs, and data privacy.

Here's how to do it properly.

Instance Selection

Mistral 7B in full float16 needs ~14GB VRAM. In 4-bit quantization (GPTQ/AWQ), it fits in ~4–5GB VRAM.

Recommended instances:

Instance	GPU	VRAM	Cost/hr	Use Case
g4dn.xlarge	T4	16GB	~$0.53	Dev/testing, 4-bit quant
g4dn.2xlarge	T4	16GB	~$0.75	Low-traffic production
g5.xlarge	A10G	24GB	~$1.01	Full precision, better throughput
g5.2xlarge	A10G	24GB	~$1.21	Medium traffic production

For cost-sensitive workloads, use g4dn.xlarge with 4-bit quantization — runs Mistral 7B comfortably at ~$380/month.

Setup: Launch EC2 Instance

bash

# Use Deep Learning AMI (comes with CUDA, PyTorch pre-installed)
# AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch (Ubuntu 22.04)
# Look for it in Community AMIs
 
# After launch, SSH in
ssh -i your-key.pem ubuntu@<ec2-ip>
 
# Verify GPU
nvidia-smi
# Should show T4 or A10G with ~16GB or 24GB VRAM

Install vLLM (Best Serving Framework)

vLLM is the fastest LLM inference engine. It uses PagedAttention for 20x higher throughput than naive implementations.

bash

# Update system
sudo apt update && sudo apt upgrade -y
 
# Install vLLM
pip install vllm
 
# For 4-bit quantization support
pip install autoawq  # AWQ quantization
# or
pip install auto-gptq  # GPTQ quantization

Deploy Mistral 7B

Option 1: Full Precision (g5.xlarge or larger)

bash

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Option 2: 4-bit Quantization (g4dn.xlarge)

bash

# Use AWQ quantized model (fits in 4GB VRAM)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

First run downloads the model from HuggingFace (~4GB for AWQ). Set HF_HOME=/mnt/data to store on a larger EBS volume.

Test the API

vLLM serves an OpenAI-compatible API:

bash

# Basic test
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
      {"role": "user", "content": "Explain Kubernetes in 3 sentences"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Python client:

python

from openai import OpenAI
 
client = OpenAI(
    api_key="not-needed",
    base_url="http://your-ec2-ip:8000/v1"
)
 
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {"role": "user", "content": "Write a Dockerfile for a Node.js app"}
    ]
)
print(response.choices[0].message.content)

Production Setup

1. Systemd Service (auto-restart on crash)

bash

sudo nano /etc/systemd/system/mistral.service

ini

[Unit]
Description=Mistral 7B vLLM Server
After=network.target
 
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
ExecStart=/home/ubuntu/.local/bin/python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
    --quantization awq \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.85
Restart=always
RestartSec=10
Environment=HF_HOME=/mnt/data/hf_cache
 
[Install]
WantedBy=multi-user.target

bash

sudo systemctl daemon-reload
sudo systemctl enable mistral
sudo systemctl start mistral
sudo systemctl status mistral

2. Nginx Reverse Proxy + API Key Auth

nginx

# /etc/nginx/sites-available/mistral
server {
    listen 443 ssl;
    server_name llm.yourcompany.com;
 
    ssl_certificate /etc/letsencrypt/live/llm.yourcompany.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.yourcompany.com/privkey.pem;
 
    location / {
        # Simple API key check
        if ($http_authorization != "Bearer your-secret-key") {
            return 401;
        }
        
        proxy_pass http://127.0.0.1:8000;
        proxy_read_timeout 120s;
        proxy_buffering off;
    }
}

3. EC2 Security Group

Inbound:
- Port 22 (SSH) — your IP only
- Port 443 (HTTPS) — 0.0.0.0/0

Outbound:
- All traffic (for HuggingFace downloads)

Cost Optimization

Use Spot Instances — Up to 70% cheaper than on-demand:

bash

# Request spot instance via AWS CLI
aws ec2 request-spot-instances \
  --instance-count 1 \
  --type "one-time" \
  --launch-specification '{
    "ImageId": "ami-xxxxx",
    "InstanceType": "g4dn.xlarge",
    "KeyName": "your-key"
  }' \
  --spot-price "0.20"

But spot instances can be interrupted. For production, use g4dn.xlarge on-demand with a 1-year reserved instance (~40% discount).

Model caching — Store the model on EBS, not EFS. EFS is too slow for model loading.

Performance Numbers (g4dn.xlarge, AWQ)

Metric	Value
Tokens/second	~40–60 tok/s
Time to first token	~500ms
Concurrent requests	4–8
Monthly cost	~$380 (on-demand)

For higher throughput, scale horizontally behind a load balancer or use g5.2xlarge.

Self-hosting Mistral 7B on EC2 gives you unlimited requests for a flat monthly cost. At $0.53/hour on g4dn.xlarge, you break even vs OpenAI GPT-3.5 at around 500K tokens/day.

Store your model on Amazon EBS gp3 volumes — 3000 IOPS baseline, faster model loading than gp2 at the same price.

How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026

Instance Selection

Setup: Launch EC2 Instance

Install vLLM (Best Serving Framework)

Deploy Mistral 7B

Option 1: Full Precision (g5.xlarge or larger)

Option 2: 4-bit Quantization (g4dn.xlarge)

Test the API

Production Setup

1. Systemd Service (auto-restart on crash)

2. Nginx Reverse Proxy + API Key Auth

3. EC2 Security Group

Cost Optimization

Performance Numbers (g4dn.xlarge, AWQ)

Stay ahead of the curve

Related Articles

Build an AI AWS Cost Anomaly Detector with Claude API and Cost Explorer

Build an AI Cloud Cost Anomaly Detector with Claude API + AWS Cost Explorer

Deploy Llama 3 on AWS Bedrock — Production Guide 2026

Comments