🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Structured Outputs and JSON Mode for LLMs in Production

How to enforce structured JSON output from LLMs in production — Claude tool use, OpenAI JSON mode, Pydantic + Instructor validation, retry logic, schema versioning, and testing pipelines with the Anthropic SDK.

DevOpsBoys6 min read
Share:Tweet

An LLM that returns free text is fine for a chatbot. An LLM powering a production pipeline that downstream services depend on needs to return valid, typed, schema-validated JSON every single time. Here is how to build that reliably.

Why Unstructured Output Breaks Production

When you ship a feature that calls an LLM and parses the response, you are trusting that the model will forever return the same structure. It will not. Common failure modes:

  • Model returns valid JSON but with extra commentary before or after it
  • Field names change subtly ("user_id" becomes "userId" or "id")
  • A field that was always a string is now a number in some responses
  • Model wraps the JSON in a markdown code fence
  • Model decides to explain its reasoning before the JSON
  • On ambiguous prompts, model returns an apology instead of JSON

Each of these silently breaks your parsing code and causes 500s or wrong data downstream.

Approach 1: Claude with Tool Use (Most Reliable)

Claude's tool use (function calling) forces structured output. Instead of asking Claude to "return JSON", you define a tool schema and Claude is forced to call it with the right shape.

python
import anthropic
import json
 
client = anthropic.Anthropic()
 
# Define the output schema as a tool
tools = [
    {
        "name": "create_deployment_summary",
        "description": "Create a structured summary of a deployment event",
        "input_schema": {
            "type": "object",
            "properties": {
                "service_name": {
                    "type": "string",
                    "description": "Name of the service being deployed"
                },
                "risk_level": {
                    "type": "string",
                    "enum": ["low", "medium", "high", "critical"],
                    "description": "Assessed deployment risk level"
                },
                "affected_components": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of components affected by this deployment"
                },
                "recommended_action": {
                    "type": "string",
                    "description": "What the on-call engineer should do next"
                },
                "confidence_score": {
                    "type": "number",
                    "minimum": 0,
                    "maximum": 1,
                    "description": "Confidence in this assessment (0-1)"
                }
            },
            "required": ["service_name", "risk_level", "affected_components", "recommended_action", "confidence_score"]
        }
    }
]
 
def analyze_deployment(deployment_log: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        tools=tools,
        tool_choice={"type": "tool", "name": "create_deployment_summary"},
        messages=[
            {
                "role": "user",
                "content": f"Analyze this deployment log and create a summary:\n\n{deployment_log}"
            }
        ]
    )
    
    # With tool_choice forced, the first content block is always a tool_use block
    tool_use_block = next(b for b in response.content if b.type == "tool_use")
    return tool_use_block.input  # Already a dict, already validated against schema
 
result = analyze_deployment("DEPLOY api-gateway v2.4.1 -> prod. Changed: auth middleware, rate limiter. 3 pods rolling update.")
print(json.dumps(result, indent=2))
# Output is guaranteed to match the schema above

Using tool_choice={"type": "tool", "name": "create_deployment_summary"} forces Claude to always call that specific tool, which means the response is always structured. No parsing needed — tool_use_block.input is already a Python dict.

Approach 2: Pydantic + Instructor for OpenAI-Style APIs

The Instructor library patches the OpenAI client to automatically parse responses into Pydantic models, with retry logic on parse failures.

python
import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field, field_validator
from typing import Literal
import re
 
# Patch the Anthropic client with Instructor
client = instructor.from_anthropic(Anthropic())
 
class DeploymentSummary(BaseModel):
    service_name: str
    risk_level: Literal["low", "medium", "high", "critical"]
    affected_components: list[str] = Field(min_length=1)
    recommended_action: str = Field(min_length=10)
    confidence_score: float = Field(ge=0.0, le=1.0)
    
    @field_validator("service_name")
    @classmethod
    def service_name_must_be_valid(cls, v):
        if not re.match(r"^[a-z0-9-]+$", v):
            raise ValueError("service_name must be lowercase alphanumeric with hyphens")
        return v
 
def analyze_deployment(log: str) -> DeploymentSummary:
    return client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        max_retries=3,  # Instructor retries on validation failure
        messages=[{"role": "user", "content": f"Analyze this deployment:\n\n{log}"}],
        response_model=DeploymentSummary,
    )
 
summary = analyze_deployment("DEPLOY payment-service v1.9.2 -> prod ...")
print(summary.risk_level)   # "high"
print(summary.confidence_score)  # 0.85
print(type(summary))  # <class 'DeploymentSummary'>

Instructor automatically retries up to max_retries times if the model returns something that fails Pydantic validation. Each retry includes the validation error in the next prompt so the model can correct itself.

Handling Partial JSON and Malformed Responses

When not using tool use or Instructor, you need a robust parser. LLMs frequently return JSON wrapped in markdown:

python
import json
import re
 
def extract_json(text: str) -> dict:
    """Extract JSON from LLM response that may contain extra text or markdown."""
    # Try direct parse first
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError:
        pass
    
    # Strip markdown code fences
    # Match ```json ... ``` or ``` ... ```
    fence_pattern = r"```(?:json)?\s*\n?([\s\S]*?)\n?```"
    match = re.search(fence_pattern, text)
    if match:
        try:
            return json.loads(match.group(1).strip())
        except json.JSONDecodeError:
            pass
    
    # Find first { ... } block
    brace_start = text.find("{")
    brace_end = text.rfind("}") + 1
    if brace_start >= 0 and brace_end > brace_start:
        try:
            return json.loads(text[brace_start:brace_end])
        except json.JSONDecodeError:
            pass
    
    raise ValueError(f"Could not extract valid JSON from response: {text[:200]}...")

Retry Logic with Schema Validation

When extraction succeeds but the content fails schema validation, retry with the error:

python
from pydantic import ValidationError
import time
 
def call_with_retry(prompt: str, schema_class, max_retries: int = 3) -> dict:
    messages = [{"role": "user", "content": prompt}]
    last_error = None
    
    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=messages,
        )
        
        raw_text = response.content[0].text
        
        try:
            data = extract_json(raw_text)
            validated = schema_class(**data)
            return validated.model_dump()
        except (ValueError, ValidationError) as e:
            last_error = str(e)
            # Add the failed attempt + error to the conversation
            messages.append({"role": "assistant", "content": raw_text})
            messages.append({
                "role": "user",
                "content": f"Your response failed validation with this error: {last_error}\n\nPlease return valid JSON matching the required schema."
            })
            
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff: 1s, 2s, 4s
    
    raise RuntimeError(f"Failed after {max_retries} attempts. Last error: {last_error}")

Versioning Your Output Schemas

Schema definitions are contracts. When you change them, existing callers may break. Version your schemas explicitly:

python
from pydantic import BaseModel
from typing import Annotated
from datetime import datetime
 
class DeploymentSummaryV1(BaseModel):
    schema_version: Literal["1.0"] = "1.0"
    service_name: str
    risk_level: str
    recommended_action: str
 
class DeploymentSummaryV2(BaseModel):
    schema_version: Literal["2.0"] = "2.0"
    service_name: str
    risk_level: Literal["low", "medium", "high", "critical"]  # now an enum
    recommended_action: str
    affected_components: list[str]  # new field in v2
    confidence_score: float         # new field in v2
 
# Store schema version alongside response in your database
# Never parse V2 responses with V1 model — check schema_version first
def parse_response(data: dict):
    version = data.get("schema_version", "1.0")
    if version == "1.0":
        return DeploymentSummaryV1(**data)
    elif version == "2.0":
        return DeploymentSummaryV2(**data)
    else:
        raise ValueError(f"Unknown schema version: {version}")

Testing Structured Output Pipelines

Your structured output code needs tests that verify both happy path and edge cases:

python
import pytest
from unittest.mock import patch, MagicMock
 
def make_mock_response(text: str):
    mock = MagicMock()
    mock.content = [MagicMock(text=text)]
    return mock
 
class TestStructuredOutput:
    def test_valid_json_parsed_correctly(self):
        valid_json = '{"service_name": "api-gateway", "risk_level": "high", "affected_components": ["auth"], "recommended_action": "Roll back immediately", "confidence_score": 0.9}'
        result = extract_json(valid_json)
        assert result["risk_level"] == "high"
    
    def test_json_inside_markdown_fence(self):
        wrapped = '```json\n{"service_name": "svc", "risk_level": "low", "affected_components": ["db"], "recommended_action": "Monitor for 30 minutes", "confidence_score": 0.6}\n```'
        result = extract_json(wrapped)
        assert result["service_name"] == "svc"
    
    def test_json_with_preamble(self):
        with_text = 'Here is the analysis:\n\n{"service_name": "cache", "risk_level": "medium", "affected_components": ["redis"], "recommended_action": "Check memory usage", "confidence_score": 0.75}'
        result = extract_json(with_text)
        assert result["confidence_score"] == 0.75
    
    def test_invalid_json_raises(self):
        with pytest.raises(ValueError):
            extract_json("I cannot provide that analysis.")

Key Production Checklist

  • Use Claude tool use with tool_choice forced to a specific tool — most reliable method
  • Always validate output against a Pydantic schema before passing downstream
  • Retry on validation failure (pass error back to model) — 3 retries covers 99%+ of cases
  • Version your schemas and store schema_version alongside every response in your database
  • Never trust response.content[0].text directly in production — always parse through schema
  • Log all raw responses before parsing for debugging
  • Set max_tokens appropriately — truncated JSON is the hardest failure mode to debug

Resources: Instructor library, Anthropic tool use docs, Pydantic v2 docs.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments