LLM Evaluation in Production — Using Claude API as Judge to Detect Quality Regressions
LLMs degrade silently. Here's how to build a weekly eval pipeline using Claude Haiku as an LLM judge, SQLite for score tracking, and GitHub Actions to alert on quality drops.
Your LLM application has no error rate spike, no latency increase, no alerts firing. But the answers it gives have quietly gotten worse. Users are leaving. You have no idea why.
This is the LLM quality problem. Here's how to catch it.
Why LLMs Degrade Silently
Unlike traditional software, LLMs don't throw exceptions when their output quality drops. A hallucinated answer, a missed nuance, a wrong code snippet — these look like successful responses from a monitoring perspective.
Common degradation triggers:
- Model provider updates the underlying model weights
- You changed your system prompt (even a small wording change)
- Your RAG retrieval is returning worse chunks
- The user query distribution shifted (users ask different things now)
- Token limits or context window changes
The Evaluation Stack
Production Logs → Sample Queries → Eval Pipeline → Quality Score → Alert if drops
You need three things:
- A test set — representative queries with expected outputs
- An evaluator — something that judges quality (LLM-as-judge or deterministic)
- A baseline — the score from when things were working well
Step 1: Build a Test Set
# eval_dataset.py
import json
# Golden test set — representative queries + expected behavior
TEST_CASES = [
{
"id": "k8s-oomkilled-1",
"query": "My Kubernetes pod is OOMKilled, how do I fix it?",
"expected_contains": ["memory limit", "resources", "limits"],
"should_not_contain": ["CPU", "network"],
"category": "troubleshooting"
},
{
"id": "terraform-state-1",
"query": "How do I import an existing AWS resource into Terraform state?",
"expected_contains": ["terraform import", "resource address"],
"category": "how-to"
},
{
"id": "factual-1",
"query": "What port does Kubernetes API server run on by default?",
"expected_contains": ["6443"],
"category": "factual"
}
]
def save_test_set(path: str = "eval_dataset.json"):
with open(path, "w") as f:
json.dump(TEST_CASES, f, indent=2)Step 2: LLM-as-Judge Evaluator
Use a fast, cheap model (Claude Haiku) to score your main model's outputs:
# evaluator.py
import anthropic
import json
import os
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def judge_response(query: str, response: str, test_case: dict) -> dict:
"""Use Claude Haiku to judge response quality."""
criteria = []
if test_case.get("expected_contains"):
criteria.append(f"Must mention: {', '.join(test_case['expected_contains'])}")
if test_case.get("should_not_contain"):
criteria.append(f"Must NOT mention: {', '.join(test_case['should_not_contain'])}")
criteria_text = "\n".join(f"- {c}" for c in criteria) if criteria else "No specific criteria"
prompt = f"""Rate this AI assistant response on a scale of 1-5.
User question: {query}
AI response: {response}
Evaluation criteria:
{criteria_text}
Score meaning:
5 = Perfect — accurate, complete, directly answers the question
4 = Good — mostly correct, minor gaps
3 = Acceptable — partially answers but missing important info
2 = Poor — wrong or misleading in key areas
1 = Fail — completely wrong or harmful
Respond with JSON only:
{{"score": <1-5>, "reason": "<one sentence>", "issues": ["<issue1>", "<issue2>"]}}"""
response_obj = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
try:
return json.loads(response_obj.content[0].text)
except:
return {"score": 3, "reason": "Parse error", "issues": []}
def deterministic_check(response: str, test_case: dict) -> dict:
"""Fast pattern-based checks without LLM cost."""
issues = []
score = 5
response_lower = response.lower()
for keyword in test_case.get("expected_contains", []):
if keyword.lower() not in response_lower:
issues.append(f"Missing expected keyword: '{keyword}'")
score -= 1
for keyword in test_case.get("should_not_contain", []):
if keyword.lower() in response_lower:
issues.append(f"Contains forbidden keyword: '{keyword}'")
score -= 2
return {"score": max(1, score), "issues": issues}Step 3: Run Evaluations
# run_eval.py
import anthropic
import json
import os
from datetime import datetime
from evaluator import judge_response, deterministic_check
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
SYSTEM_PROMPT = "You are a DevOps assistant. Answer questions about Kubernetes, Docker, AWS, Terraform, and CI/CD."
def get_model_response(query: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text
def run_evaluation(test_cases: list[dict], use_llm_judge: bool = True) -> dict:
results = []
total_score = 0
for tc in test_cases:
response = get_model_response(tc["query"])
# Fast deterministic check first
det_result = deterministic_check(response, tc)
# LLM judge for quality (optional, costs money)
if use_llm_judge:
llm_result = judge_response(tc["query"], response, tc)
score = (det_result["score"] + llm_result["score"]) / 2
else:
score = det_result["score"]
llm_result = {}
results.append({
"id": tc["id"],
"category": tc["category"],
"score": score,
"deterministic": det_result,
"llm_judge": llm_result,
"response_preview": response[:200]
})
total_score += score
print(f" {tc['id']}: {score:.1f}/5")
avg_score = total_score / len(test_cases) if test_cases else 0
return {
"timestamp": datetime.utcnow().isoformat(),
"avg_score": round(avg_score, 2),
"num_tests": len(test_cases),
"results": results,
"pass": avg_score >= 3.5 # Configurable threshold
}
if __name__ == "__main__":
with open("eval_dataset.json") as f:
test_cases = json.load(f)
print(f"Running {len(test_cases)} evaluations...")
report = run_evaluation(test_cases)
print(f"\nOverall score: {report['avg_score']}/5.0")
print(f"Status: {'✅ PASS' if report['pass'] else '❌ FAIL'}")
with open(f"eval_report_{datetime.utcnow().strftime('%Y%m%d_%H%M')}.json", "w") as f:
json.dump(report, f, indent=2)Track Over Time
# Store scores in a time-series to detect regressions
import sqlite3
def store_score(run_id: str, avg_score: float, model: str):
conn = sqlite3.connect("eval_history.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS scores
(id TEXT, timestamp TEXT, score REAL, model TEXT)
""")
conn.execute("INSERT INTO scores VALUES (?, datetime('now'), ?, ?)",
(run_id, avg_score, model))
conn.commit()
conn.close()
def detect_regression(threshold: float = 0.3) -> bool:
"""Alert if score dropped more than threshold vs last week avg."""
conn = sqlite3.connect("eval_history.db")
cursor = conn.execute("""
SELECT AVG(score) FROM scores
WHERE timestamp > datetime('now', '-7 days')
""")
last_week_avg = cursor.fetchone()[0]
cursor = conn.execute("SELECT score FROM scores ORDER BY timestamp DESC LIMIT 1")
latest = cursor.fetchone()[0]
conn.close()
if last_week_avg and latest:
return (last_week_avg - latest) > threshold
return FalseRun as Weekly CI Job
# .github/workflows/llm-eval.yml
on:
schedule:
- cron: '0 6 * * 1' # Every Monday 6 AM
workflow_dispatch:
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install anthropic
- run: python run_eval.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Fail if quality dropped
run: python -c "import json; r=json.load(open('eval_report_latest.json')); exit(0 if r['pass'] else 1)"Catch quality regressions before users notice. A weekly eval run against a golden test set gives you an early warning system that traditional monitoring completely misses.
The Anthropic API supports both the production model and the judge model — running a 50-question eval costs under $0.50 total.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Auto-Generate Terraform Modules Using OpenAI Function Calling
Build a tool that takes plain English descriptions and generates production-ready Terraform modules using OpenAI's function calling API. No more starting from scratch.
Build an AI Cloud Cost Anomaly Detector with Claude API + AWS Cost Explorer
Cloud costs spike without warning. Build a Python bot using AWS Cost Explorer + Claude API that detects anomalies using Z-score analysis and explains the spike in plain English.
Build an AI Kubernetes Deployment Readiness Checker with Claude API
Build a Python CLI tool using Claude API that analyzes Kubernetes YAML manifests before deployment — catches missing resource limits, root containers, and security issues with a go/no-go score.