Structured Outputs and JSON Mode for LLMs in Production
How to enforce structured JSON output from LLMs in production — Claude tool use, OpenAI JSON mode, Pydantic + Instructor validation, retry logic, schema versioning, and testing pipelines with the Anthropic SDK.
An LLM that returns free text is fine for a chatbot. An LLM powering a production pipeline that downstream services depend on needs to return valid, typed, schema-validated JSON every single time. Here is how to build that reliably.
Why Unstructured Output Breaks Production
When you ship a feature that calls an LLM and parses the response, you are trusting that the model will forever return the same structure. It will not. Common failure modes:
- Model returns valid JSON but with extra commentary before or after it
- Field names change subtly ("user_id" becomes "userId" or "id")
- A field that was always a string is now a number in some responses
- Model wraps the JSON in a markdown code fence
- Model decides to explain its reasoning before the JSON
- On ambiguous prompts, model returns an apology instead of JSON
Each of these silently breaks your parsing code and causes 500s or wrong data downstream.
Approach 1: Claude with Tool Use (Most Reliable)
Claude's tool use (function calling) forces structured output. Instead of asking Claude to "return JSON", you define a tool schema and Claude is forced to call it with the right shape.
import anthropic
import json
client = anthropic.Anthropic()
# Define the output schema as a tool
tools = [
{
"name": "create_deployment_summary",
"description": "Create a structured summary of a deployment event",
"input_schema": {
"type": "object",
"properties": {
"service_name": {
"type": "string",
"description": "Name of the service being deployed"
},
"risk_level": {
"type": "string",
"enum": ["low", "medium", "high", "critical"],
"description": "Assessed deployment risk level"
},
"affected_components": {
"type": "array",
"items": {"type": "string"},
"description": "List of components affected by this deployment"
},
"recommended_action": {
"type": "string",
"description": "What the on-call engineer should do next"
},
"confidence_score": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Confidence in this assessment (0-1)"
}
},
"required": ["service_name", "risk_level", "affected_components", "recommended_action", "confidence_score"]
}
}
]
def analyze_deployment(deployment_log: str) -> dict:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
tools=tools,
tool_choice={"type": "tool", "name": "create_deployment_summary"},
messages=[
{
"role": "user",
"content": f"Analyze this deployment log and create a summary:\n\n{deployment_log}"
}
]
)
# With tool_choice forced, the first content block is always a tool_use block
tool_use_block = next(b for b in response.content if b.type == "tool_use")
return tool_use_block.input # Already a dict, already validated against schema
result = analyze_deployment("DEPLOY api-gateway v2.4.1 -> prod. Changed: auth middleware, rate limiter. 3 pods rolling update.")
print(json.dumps(result, indent=2))
# Output is guaranteed to match the schema aboveUsing tool_choice={"type": "tool", "name": "create_deployment_summary"} forces Claude to always call that specific tool, which means the response is always structured. No parsing needed — tool_use_block.input is already a Python dict.
Approach 2: Pydantic + Instructor for OpenAI-Style APIs
The Instructor library patches the OpenAI client to automatically parse responses into Pydantic models, with retry logic on parse failures.
import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field, field_validator
from typing import Literal
import re
# Patch the Anthropic client with Instructor
client = instructor.from_anthropic(Anthropic())
class DeploymentSummary(BaseModel):
service_name: str
risk_level: Literal["low", "medium", "high", "critical"]
affected_components: list[str] = Field(min_length=1)
recommended_action: str = Field(min_length=10)
confidence_score: float = Field(ge=0.0, le=1.0)
@field_validator("service_name")
@classmethod
def service_name_must_be_valid(cls, v):
if not re.match(r"^[a-z0-9-]+$", v):
raise ValueError("service_name must be lowercase alphanumeric with hyphens")
return v
def analyze_deployment(log: str) -> DeploymentSummary:
return client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
max_retries=3, # Instructor retries on validation failure
messages=[{"role": "user", "content": f"Analyze this deployment:\n\n{log}"}],
response_model=DeploymentSummary,
)
summary = analyze_deployment("DEPLOY payment-service v1.9.2 -> prod ...")
print(summary.risk_level) # "high"
print(summary.confidence_score) # 0.85
print(type(summary)) # <class 'DeploymentSummary'>Instructor automatically retries up to max_retries times if the model returns something that fails Pydantic validation. Each retry includes the validation error in the next prompt so the model can correct itself.
Handling Partial JSON and Malformed Responses
When not using tool use or Instructor, you need a robust parser. LLMs frequently return JSON wrapped in markdown:
import json
import re
def extract_json(text: str) -> dict:
"""Extract JSON from LLM response that may contain extra text or markdown."""
# Try direct parse first
try:
return json.loads(text.strip())
except json.JSONDecodeError:
pass
# Strip markdown code fences
# Match ```json ... ``` or ``` ... ```
fence_pattern = r"```(?:json)?\s*\n?([\s\S]*?)\n?```"
match = re.search(fence_pattern, text)
if match:
try:
return json.loads(match.group(1).strip())
except json.JSONDecodeError:
pass
# Find first { ... } block
brace_start = text.find("{")
brace_end = text.rfind("}") + 1
if brace_start >= 0 and brace_end > brace_start:
try:
return json.loads(text[brace_start:brace_end])
except json.JSONDecodeError:
pass
raise ValueError(f"Could not extract valid JSON from response: {text[:200]}...")Retry Logic with Schema Validation
When extraction succeeds but the content fails schema validation, retry with the error:
from pydantic import ValidationError
import time
def call_with_retry(prompt: str, schema_class, max_retries: int = 3) -> dict:
messages = [{"role": "user", "content": prompt}]
last_error = None
for attempt in range(max_retries):
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=messages,
)
raw_text = response.content[0].text
try:
data = extract_json(raw_text)
validated = schema_class(**data)
return validated.model_dump()
except (ValueError, ValidationError) as e:
last_error = str(e)
# Add the failed attempt + error to the conversation
messages.append({"role": "assistant", "content": raw_text})
messages.append({
"role": "user",
"content": f"Your response failed validation with this error: {last_error}\n\nPlease return valid JSON matching the required schema."
})
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff: 1s, 2s, 4s
raise RuntimeError(f"Failed after {max_retries} attempts. Last error: {last_error}")Versioning Your Output Schemas
Schema definitions are contracts. When you change them, existing callers may break. Version your schemas explicitly:
from pydantic import BaseModel
from typing import Annotated
from datetime import datetime
class DeploymentSummaryV1(BaseModel):
schema_version: Literal["1.0"] = "1.0"
service_name: str
risk_level: str
recommended_action: str
class DeploymentSummaryV2(BaseModel):
schema_version: Literal["2.0"] = "2.0"
service_name: str
risk_level: Literal["low", "medium", "high", "critical"] # now an enum
recommended_action: str
affected_components: list[str] # new field in v2
confidence_score: float # new field in v2
# Store schema version alongside response in your database
# Never parse V2 responses with V1 model — check schema_version first
def parse_response(data: dict):
version = data.get("schema_version", "1.0")
if version == "1.0":
return DeploymentSummaryV1(**data)
elif version == "2.0":
return DeploymentSummaryV2(**data)
else:
raise ValueError(f"Unknown schema version: {version}")Testing Structured Output Pipelines
Your structured output code needs tests that verify both happy path and edge cases:
import pytest
from unittest.mock import patch, MagicMock
def make_mock_response(text: str):
mock = MagicMock()
mock.content = [MagicMock(text=text)]
return mock
class TestStructuredOutput:
def test_valid_json_parsed_correctly(self):
valid_json = '{"service_name": "api-gateway", "risk_level": "high", "affected_components": ["auth"], "recommended_action": "Roll back immediately", "confidence_score": 0.9}'
result = extract_json(valid_json)
assert result["risk_level"] == "high"
def test_json_inside_markdown_fence(self):
wrapped = '```json\n{"service_name": "svc", "risk_level": "low", "affected_components": ["db"], "recommended_action": "Monitor for 30 minutes", "confidence_score": 0.6}\n```'
result = extract_json(wrapped)
assert result["service_name"] == "svc"
def test_json_with_preamble(self):
with_text = 'Here is the analysis:\n\n{"service_name": "cache", "risk_level": "medium", "affected_components": ["redis"], "recommended_action": "Check memory usage", "confidence_score": 0.75}'
result = extract_json(with_text)
assert result["confidence_score"] == 0.75
def test_invalid_json_raises(self):
with pytest.raises(ValueError):
extract_json("I cannot provide that analysis.")Key Production Checklist
- Use Claude tool use with
tool_choiceforced to a specific tool — most reliable method - Always validate output against a Pydantic schema before passing downstream
- Retry on validation failure (pass error back to model) — 3 retries covers 99%+ of cases
- Version your schemas and store
schema_versionalongside every response in your database - Never trust
response.content[0].textdirectly in production — always parse through schema - Log all raw responses before parsing for debugging
- Set
max_tokensappropriately — truncated JSON is the hardest failure mode to debug
Resources: Instructor library, Anthropic tool use docs, Pydantic v2 docs.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
LLM Multi-Agent Orchestration with LangGraph in Production
Build a production-ready multi-agent system with LangGraph for DevOps automation — Planner, Executor, and Reviewer agents with shared state, conditional edges, human-in-the-loop checkpoints, and LangSmith observability.
AI Coding Assistants Will Change DevOps — But Not in the Way You Think
GitHub Copilot, Cursor, and Claude are already writing infrastructure code. But the real disruption isn't replacing DevOps engineers — it's reshaping what the job actually is.
Build an AI-Powered Terraform Drift Detection System
Terraform drift happens silently. Here's how to build an automated drift detector using Terraform plan + Claude API that alerts your team and explains exactly what changed.