Deploying Multi-Modal LLMs with Vision in Production: A Practical Guide

Serving vision-enabled LLMs (Claude Vision, GPT-4V) in production requires different patterns from text-only models. Here's how to handle images, latency, and cost at scale.

Vision-enabled LLMs unlock a new class of applications — infrastructure diagram analysis, screenshot-based debugging, document parsing — but they introduce complexities that text-only deployments don't have.

What's Different About Vision Workloads

Image preprocessing: You need to resize, compress, and encode images before sending to the API. A 4K screenshot can cost 10x more tokens than necessary.

Latency: Vision requests are typically 2-5x slower than text-only due to image tokenization on the provider side.

Cost: Vision tokens (image "patches") are billed separately and can dominate cost. A single high-res image can cost more than a 2000-word prompt.

Async patterns: Users shouldn't wait 10+ seconds for a screenshot to be analyzed. Background processing with webhooks is often better UX.

Optimizing Images Before Sending

python

import base64
from io import BytesIO
from PIL import Image
from anthropic import Anthropic
 
client = Anthropic()
 
def optimize_image_for_llm(
    image_path: str,
    max_width: int = 1568,
    max_height: int = 1568,
    quality: int = 85,
    format: str = "JPEG"
) -> tuple[str, str]:
    """
    Resize and compress image for efficient LLM processing.
    Returns (base64_data, media_type)
    """
    with Image.open(image_path) as img:
        # Convert RGBA to RGB (JPEG doesn't support alpha)
        if img.mode in ("RGBA", "P"):
            img = img.convert("RGB")
        
        # Resize if too large (preserving aspect ratio)
        img.thumbnail((max_width, max_height), Image.LANCZOS)
        
        # Compress
        buffer = BytesIO()
        img.save(buffer, format=format, quality=quality, optimize=True)
        buffer.seek(0)
        
        image_data = base64.standard_b64encode(buffer.read()).decode("utf-8")
        media_type = f"image/{format.lower()}"
        
        return image_data, media_type
 
 
def analyze_infrastructure_diagram(image_path: str) -> dict:
    """Analyze an infrastructure diagram or architecture screenshot."""
    image_data, media_type = optimize_image_for_llm(image_path)
    
    message = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=2000,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": """Analyze this infrastructure diagram and provide:
1. Components identified (services, databases, queues, etc.)
2. Data flow description
3. Potential single points of failure
4. Security concerns
5. Scalability observations
 
Format as JSON with these exact keys: components, data_flow, spof, security_concerns, scalability_notes"""
                    }
                ],
            }
        ],
    )
    
    import json
    response_text = message.content[0].text
    start = response_text.find("{")
    end = response_text.rfind("}") + 1
    return json.loads(response_text[start:end])

URL-Based Images (Cheaper for Public Images)

Instead of uploading base64, use URLs for public images. This avoids base64 encoding overhead:

python

def analyze_public_screenshot(image_url: str, question: str) -> str:
    """Analyze a publicly accessible image via URL."""
    message = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1000,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "url",
                            "url": image_url,
                        },
                    },
                    {"type": "text", "text": question}
                ],
            }
        ],
    )
    return message.content[0].text

Note: URL-based images must be publicly accessible. Use base64 for private/internal images.

Production Pattern: Async Vision Processing

Don't make users wait 8 seconds for a synchronous vision API call. Use a queue:

python

# tasks.py (using Celery + Redis)
from celery import Celery
from anthropic import Anthropic
import redis
import json
 
celery = Celery("vision_tasks", broker="redis://redis:6379/0")
client = Anthropic()
redis_client = redis.Redis(host="redis", port=6379)
 
@celery.task(bind=True, max_retries=3)
def analyze_screenshot_async(self, task_id: str, image_b64: str, prompt: str):
    """Process screenshot analysis in background."""
    try:
        message = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=1500,
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/jpeg",
                            "data": image_b64,
                        },
                    },
                    {"type": "text", "text": prompt}
                ],
            }]
        )
        
        result = {
            "task_id": task_id,
            "status": "complete",
            "result": message.content[0].text,
            "input_tokens": message.usage.input_tokens,
            "output_tokens": message.usage.output_tokens,
        }
        
        # Store result for 1 hour
        redis_client.setex(f"vision:{task_id}", 3600, json.dumps(result))
        
    except Exception as exc:
        redis_client.setex(f"vision:{task_id}", 3600, json.dumps({
            "task_id": task_id,
            "status": "failed",
            "error": str(exc),
        }))
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)
 
 
# API endpoints
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uuid
 
api = FastAPI()
 
class VisionRequest(BaseModel):
    image_b64: str
    prompt: str
 
@api.post("/vision/analyze")
async def submit_analysis(request: VisionRequest):
    task_id = str(uuid.uuid4())
    analyze_screenshot_async.delay(task_id, request.image_b64, request.prompt)
    return {"task_id": task_id, "status": "queued"}
 
@api.get("/vision/result/{task_id}")
async def get_result(task_id: str):
    data = redis_client.get(f"vision:{task_id}")
    if not data:
        return {"task_id": task_id, "status": "processing"}
    return json.loads(data)

Cost Control: Smart Model Routing

Use a cheaper model for simple image questions, expensive model for complex analysis:

python

def smart_vision_route(image_b64: str, prompt: str, complexity: str = "auto") -> str:
    """Route to appropriate model based on task complexity."""
    
    if complexity == "auto":
        # Simple heuristics for routing
        complex_keywords = ["analyze", "identify all", "detailed", "architecture", "security"]
        complexity = "high" if any(k in prompt.lower() for k in complex_keywords) else "low"
    
    model = "claude-opus-4-8" if complexity == "high" else "claude-haiku-4-5-20251001"
    
    message = client.messages.create(
        model=model,
        max_tokens=1000 if complexity == "low" else 2000,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/jpeg", "data": image_b64},
                },
                {"type": "text", "text": prompt}
            ],
        }]
    )
    
    return message.content[0].text

Observability: Track Vision Request Metrics

python

from prometheus_client import Counter, Histogram, Gauge
import time
 
vision_requests_total = Counter("llm_vision_requests_total", "Total vision requests", ["model", "status"])
vision_latency = Histogram("llm_vision_latency_seconds", "Vision request latency", ["model"])
vision_image_tokens = Histogram("llm_vision_image_tokens", "Image tokens per request", ["model"])
vision_total_cost = Counter("llm_vision_total_cost_usd", "Total cost USD", ["model"])
 
# Pricing (example - check current rates)
TOKEN_COSTS = {
    "claude-opus-4-8": {"input": 15.0, "output": 75.0},       # per 1M tokens
    "claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
}
 
def tracked_vision_request(image_b64: str, prompt: str, model: str) -> str:
    start = time.time()
    status = "success"
    
    try:
        message = client.messages.create(
            model=model,
            max_tokens=1500,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_b64}},
                    {"type": "text", "text": prompt}
                ],
            }]
        )
        
        # Track tokens
        input_tokens = message.usage.input_tokens
        output_tokens = message.usage.output_tokens
        vision_image_tokens.labels(model=model).observe(input_tokens)
        
        # Track cost
        cost = (input_tokens / 1_000_000 * TOKEN_COSTS[model]["input"] + 
                output_tokens / 1_000_000 * TOKEN_COSTS[model]["output"])
        vision_total_cost.labels(model=model).inc(cost)
        
        return message.content[0].text
        
    except Exception as e:
        status = "error"
        raise
    finally:
        vision_requests_total.labels(model=model, status=status).inc()
        vision_latency.labels(model=model).observe(time.time() - start)

Kubernetes Deployment for Vision Service

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vision-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: api
          image: vision-service:latest
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "2Gi"  # images in memory can be large
              cpu: "2000m"
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-secrets
                  key: anthropic_api_key
            - name: MAX_IMAGE_SIZE_MB
              value: "10"
            - name: CELERY_BROKER_URL
              value: "redis://redis-service:6379/0"

Vision is one of the fastest-growing LLM use cases in DevOps — diagram analysis, runbook screenshot parsing, monitoring dashboard interpretation. Getting the async patterns right from the start pays off quickly.

Resources: Anthropic Vision guide | Celery docs

Deploying Multi-Modal LLMs with Vision in Production: A Practical Guide

What's Different About Vision Workloads

Optimizing Images Before Sending

URL-Based Images (Cheaper for Public Images)

Production Pattern: Async Vision Processing

Cost Control: Smart Model Routing

Observability: Track Vision Request Metrics

Kubernetes Deployment for Vision Service

Stay ahead of the curve

Related Articles

AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds

Build an AI Kubernetes Runbook Generator with LLMs (2026)

Build an AI-Powered SLO Breach Predictor with Claude and Prometheus

Comments