Deploying Multi-Modal LLMs with Vision in Production: A Practical Guide
Serving vision-enabled LLMs (Claude Vision, GPT-4V) in production requires different patterns from text-only models. Here's how to handle images, latency, and cost at scale.
Vision-enabled LLMs unlock a new class of applications — infrastructure diagram analysis, screenshot-based debugging, document parsing — but they introduce complexities that text-only deployments don't have.
What's Different About Vision Workloads
Image preprocessing: You need to resize, compress, and encode images before sending to the API. A 4K screenshot can cost 10x more tokens than necessary.
Latency: Vision requests are typically 2-5x slower than text-only due to image tokenization on the provider side.
Cost: Vision tokens (image "patches") are billed separately and can dominate cost. A single high-res image can cost more than a 2000-word prompt.
Async patterns: Users shouldn't wait 10+ seconds for a screenshot to be analyzed. Background processing with webhooks is often better UX.
Optimizing Images Before Sending
import base64
from io import BytesIO
from PIL import Image
from anthropic import Anthropic
client = Anthropic()
def optimize_image_for_llm(
image_path: str,
max_width: int = 1568,
max_height: int = 1568,
quality: int = 85,
format: str = "JPEG"
) -> tuple[str, str]:
"""
Resize and compress image for efficient LLM processing.
Returns (base64_data, media_type)
"""
with Image.open(image_path) as img:
# Convert RGBA to RGB (JPEG doesn't support alpha)
if img.mode in ("RGBA", "P"):
img = img.convert("RGB")
# Resize if too large (preserving aspect ratio)
img.thumbnail((max_width, max_height), Image.LANCZOS)
# Compress
buffer = BytesIO()
img.save(buffer, format=format, quality=quality, optimize=True)
buffer.seek(0)
image_data = base64.standard_b64encode(buffer.read()).decode("utf-8")
media_type = f"image/{format.lower()}"
return image_data, media_type
def analyze_infrastructure_diagram(image_path: str) -> dict:
"""Analyze an infrastructure diagram or architecture screenshot."""
image_data, media_type = optimize_image_for_llm(image_path)
message = client.messages.create(
model="claude-opus-4-8",
max_tokens=2000,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{
"type": "text",
"text": """Analyze this infrastructure diagram and provide:
1. Components identified (services, databases, queues, etc.)
2. Data flow description
3. Potential single points of failure
4. Security concerns
5. Scalability observations
Format as JSON with these exact keys: components, data_flow, spof, security_concerns, scalability_notes"""
}
],
}
],
)
import json
response_text = message.content[0].text
start = response_text.find("{")
end = response_text.rfind("}") + 1
return json.loads(response_text[start:end])URL-Based Images (Cheaper for Public Images)
Instead of uploading base64, use URLs for public images. This avoids base64 encoding overhead:
def analyze_public_screenshot(image_url: str, question: str) -> str:
"""Analyze a publicly accessible image via URL."""
message = client.messages.create(
model="claude-opus-4-8",
max_tokens=1000,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": image_url,
},
},
{"type": "text", "text": question}
],
}
],
)
return message.content[0].textNote: URL-based images must be publicly accessible. Use base64 for private/internal images.
Production Pattern: Async Vision Processing
Don't make users wait 8 seconds for a synchronous vision API call. Use a queue:
# tasks.py (using Celery + Redis)
from celery import Celery
from anthropic import Anthropic
import redis
import json
celery = Celery("vision_tasks", broker="redis://redis:6379/0")
client = Anthropic()
redis_client = redis.Redis(host="redis", port=6379)
@celery.task(bind=True, max_retries=3)
def analyze_screenshot_async(self, task_id: str, image_b64: str, prompt: str):
"""Process screenshot analysis in background."""
try:
message = client.messages.create(
model="claude-opus-4-8",
max_tokens=1500,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_b64,
},
},
{"type": "text", "text": prompt}
],
}]
)
result = {
"task_id": task_id,
"status": "complete",
"result": message.content[0].text,
"input_tokens": message.usage.input_tokens,
"output_tokens": message.usage.output_tokens,
}
# Store result for 1 hour
redis_client.setex(f"vision:{task_id}", 3600, json.dumps(result))
except Exception as exc:
redis_client.setex(f"vision:{task_id}", 3600, json.dumps({
"task_id": task_id,
"status": "failed",
"error": str(exc),
}))
raise self.retry(exc=exc, countdown=2 ** self.request.retries)
# API endpoints
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
api = FastAPI()
class VisionRequest(BaseModel):
image_b64: str
prompt: str
@api.post("/vision/analyze")
async def submit_analysis(request: VisionRequest):
task_id = str(uuid.uuid4())
analyze_screenshot_async.delay(task_id, request.image_b64, request.prompt)
return {"task_id": task_id, "status": "queued"}
@api.get("/vision/result/{task_id}")
async def get_result(task_id: str):
data = redis_client.get(f"vision:{task_id}")
if not data:
return {"task_id": task_id, "status": "processing"}
return json.loads(data)Cost Control: Smart Model Routing
Use a cheaper model for simple image questions, expensive model for complex analysis:
def smart_vision_route(image_b64: str, prompt: str, complexity: str = "auto") -> str:
"""Route to appropriate model based on task complexity."""
if complexity == "auto":
# Simple heuristics for routing
complex_keywords = ["analyze", "identify all", "detailed", "architecture", "security"]
complexity = "high" if any(k in prompt.lower() for k in complex_keywords) else "low"
model = "claude-opus-4-8" if complexity == "high" else "claude-haiku-4-5-20251001"
message = client.messages.create(
model=model,
max_tokens=1000 if complexity == "low" else 2000,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": image_b64},
},
{"type": "text", "text": prompt}
],
}]
)
return message.content[0].textObservability: Track Vision Request Metrics
from prometheus_client import Counter, Histogram, Gauge
import time
vision_requests_total = Counter("llm_vision_requests_total", "Total vision requests", ["model", "status"])
vision_latency = Histogram("llm_vision_latency_seconds", "Vision request latency", ["model"])
vision_image_tokens = Histogram("llm_vision_image_tokens", "Image tokens per request", ["model"])
vision_total_cost = Counter("llm_vision_total_cost_usd", "Total cost USD", ["model"])
# Pricing (example - check current rates)
TOKEN_COSTS = {
"claude-opus-4-8": {"input": 15.0, "output": 75.0}, # per 1M tokens
"claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
}
def tracked_vision_request(image_b64: str, prompt: str, model: str) -> str:
start = time.time()
status = "success"
try:
message = client.messages.create(
model=model,
max_tokens=1500,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_b64}},
{"type": "text", "text": prompt}
],
}]
)
# Track tokens
input_tokens = message.usage.input_tokens
output_tokens = message.usage.output_tokens
vision_image_tokens.labels(model=model).observe(input_tokens)
# Track cost
cost = (input_tokens / 1_000_000 * TOKEN_COSTS[model]["input"] +
output_tokens / 1_000_000 * TOKEN_COSTS[model]["output"])
vision_total_cost.labels(model=model).inc(cost)
return message.content[0].text
except Exception as e:
status = "error"
raise
finally:
vision_requests_total.labels(model=model, status=status).inc()
vision_latency.labels(model=model).observe(time.time() - start)Kubernetes Deployment for Vision Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: vision-service
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: vision-service:latest
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi" # images in memory can be large
cpu: "2000m"
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: anthropic_api_key
- name: MAX_IMAGE_SIZE_MB
value: "10"
- name: CELERY_BROKER_URL
value: "redis://redis-service:6379/0"Vision is one of the fastest-growing LLM use cases in DevOps — diagram analysis, runbook screenshot parsing, monitoring dashboard interpretation. Getting the async patterns right from the start pays off quickly.
Resources: Anthropic Vision guide | Celery docs
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI-Powered SLO Breach Predictor with Claude and Prometheus
Build an SLO breach predictor that reads error budget burn rate from Prometheus, uses Claude to analyze patterns, and sends Slack alerts before SLOs breach — not after.