LLM Evaluation in Production — Using Claude API as Judge to Detect Quality Regressions

LLMs degrade silently. Here's how to build a weekly eval pipeline using Claude Haiku as an LLM judge, SQLite for score tracking, and GitHub Actions to alert on quality drops.

Your LLM application has no error rate spike, no latency increase, no alerts firing. But the answers it gives have quietly gotten worse. Users are leaving. You have no idea why.

This is the LLM quality problem. Here's how to catch it.

Why LLMs Degrade Silently

Unlike traditional software, LLMs don't throw exceptions when their output quality drops. A hallucinated answer, a missed nuance, a wrong code snippet — these look like successful responses from a monitoring perspective.

Common degradation triggers:

Model provider updates the underlying model weights
You changed your system prompt (even a small wording change)
Your RAG retrieval is returning worse chunks
The user query distribution shifted (users ask different things now)
Token limits or context window changes

The Evaluation Stack

Production Logs → Sample Queries → Eval Pipeline → Quality Score → Alert if drops

You need three things:

A test set — representative queries with expected outputs
An evaluator — something that judges quality (LLM-as-judge or deterministic)
A baseline — the score from when things were working well

Step 1: Build a Test Set

python

# eval_dataset.py
import json
 
# Golden test set — representative queries + expected behavior
TEST_CASES = [
    {
        "id": "k8s-oomkilled-1",
        "query": "My Kubernetes pod is OOMKilled, how do I fix it?",
        "expected_contains": ["memory limit", "resources", "limits"],
        "should_not_contain": ["CPU", "network"],
        "category": "troubleshooting"
    },
    {
        "id": "terraform-state-1", 
        "query": "How do I import an existing AWS resource into Terraform state?",
        "expected_contains": ["terraform import", "resource address"],
        "category": "how-to"
    },
    {
        "id": "factual-1",
        "query": "What port does Kubernetes API server run on by default?",
        "expected_contains": ["6443"],
        "category": "factual"
    }
]
 
def save_test_set(path: str = "eval_dataset.json"):
    with open(path, "w") as f:
        json.dump(TEST_CASES, f, indent=2)

Step 2: LLM-as-Judge Evaluator

Use a fast, cheap model (Claude Haiku) to score your main model's outputs:

python

# evaluator.py
import anthropic
import json
import os
 
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
def judge_response(query: str, response: str, test_case: dict) -> dict:
    """Use Claude Haiku to judge response quality."""
    
    criteria = []
    if test_case.get("expected_contains"):
        criteria.append(f"Must mention: {', '.join(test_case['expected_contains'])}")
    if test_case.get("should_not_contain"):
        criteria.append(f"Must NOT mention: {', '.join(test_case['should_not_contain'])}")
    
    criteria_text = "\n".join(f"- {c}" for c in criteria) if criteria else "No specific criteria"
    
    prompt = f"""Rate this AI assistant response on a scale of 1-5.
 
User question: {query}
 
AI response: {response}
 
Evaluation criteria:
{criteria_text}
 
Score meaning:
5 = Perfect — accurate, complete, directly answers the question
4 = Good — mostly correct, minor gaps
3 = Acceptable — partially answers but missing important info
2 = Poor — wrong or misleading in key areas
1 = Fail — completely wrong or harmful
 
Respond with JSON only:
{{"score": <1-5>, "reason": "<one sentence>", "issues": ["<issue1>", "<issue2>"]}}"""
 
    response_obj = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    
    try:
        return json.loads(response_obj.content[0].text)
    except:
        return {"score": 3, "reason": "Parse error", "issues": []}
 
 
def deterministic_check(response: str, test_case: dict) -> dict:
    """Fast pattern-based checks without LLM cost."""
    issues = []
    score = 5
    
    response_lower = response.lower()
    
    for keyword in test_case.get("expected_contains", []):
        if keyword.lower() not in response_lower:
            issues.append(f"Missing expected keyword: '{keyword}'")
            score -= 1
    
    for keyword in test_case.get("should_not_contain", []):
        if keyword.lower() in response_lower:
            issues.append(f"Contains forbidden keyword: '{keyword}'")
            score -= 2
    
    return {"score": max(1, score), "issues": issues}

Step 3: Run Evaluations

python

# run_eval.py
import anthropic
import json
import os
from datetime import datetime
from evaluator import judge_response, deterministic_check
 
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
SYSTEM_PROMPT = "You are a DevOps assistant. Answer questions about Kubernetes, Docker, AWS, Terraform, and CI/CD."
 
def get_model_response(query: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": query}]
    )
    return response.content[0].text
 
def run_evaluation(test_cases: list[dict], use_llm_judge: bool = True) -> dict:
    results = []
    total_score = 0
    
    for tc in test_cases:
        response = get_model_response(tc["query"])
        
        # Fast deterministic check first
        det_result = deterministic_check(response, tc)
        
        # LLM judge for quality (optional, costs money)
        if use_llm_judge:
            llm_result = judge_response(tc["query"], response, tc)
            score = (det_result["score"] + llm_result["score"]) / 2
        else:
            score = det_result["score"]
            llm_result = {}
        
        results.append({
            "id": tc["id"],
            "category": tc["category"],
            "score": score,
            "deterministic": det_result,
            "llm_judge": llm_result,
            "response_preview": response[:200]
        })
        
        total_score += score
        print(f"  {tc['id']}: {score:.1f}/5")
    
    avg_score = total_score / len(test_cases) if test_cases else 0
    
    return {
        "timestamp": datetime.utcnow().isoformat(),
        "avg_score": round(avg_score, 2),
        "num_tests": len(test_cases),
        "results": results,
        "pass": avg_score >= 3.5   # Configurable threshold
    }
 
if __name__ == "__main__":
    with open("eval_dataset.json") as f:
        test_cases = json.load(f)
    
    print(f"Running {len(test_cases)} evaluations...")
    report = run_evaluation(test_cases)
    
    print(f"\nOverall score: {report['avg_score']}/5.0")
    print(f"Status: {'✅ PASS' if report['pass'] else '❌ FAIL'}")
    
    with open(f"eval_report_{datetime.utcnow().strftime('%Y%m%d_%H%M')}.json", "w") as f:
        json.dump(report, f, indent=2)

Track Over Time

python

# Store scores in a time-series to detect regressions
import sqlite3
 
def store_score(run_id: str, avg_score: float, model: str):
    conn = sqlite3.connect("eval_history.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS scores 
        (id TEXT, timestamp TEXT, score REAL, model TEXT)
    """)
    conn.execute("INSERT INTO scores VALUES (?, datetime('now'), ?, ?)",
                 (run_id, avg_score, model))
    conn.commit()
    conn.close()
 
def detect_regression(threshold: float = 0.3) -> bool:
    """Alert if score dropped more than threshold vs last week avg."""
    conn = sqlite3.connect("eval_history.db")
    cursor = conn.execute("""
        SELECT AVG(score) FROM scores 
        WHERE timestamp > datetime('now', '-7 days')
    """)
    last_week_avg = cursor.fetchone()[0]
    
    cursor = conn.execute("SELECT score FROM scores ORDER BY timestamp DESC LIMIT 1")
    latest = cursor.fetchone()[0]
    conn.close()
    
    if last_week_avg and latest:
        return (last_week_avg - latest) > threshold
    return False

Run as Weekly CI Job

yaml

# .github/workflows/llm-eval.yml
on:
  schedule:
    - cron: '0 6 * * 1'   # Every Monday 6 AM
  workflow_dispatch:
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install anthropic
      - run: python run_eval.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Fail if quality dropped
        run: python -c "import json; r=json.load(open('eval_report_latest.json')); exit(0 if r['pass'] else 1)"

Catch quality regressions before users notice. A weekly eval run against a golden test set gives you an early warning system that traditional monitoring completely misses.

The Anthropic API supports both the production model and the judge model — running a 50-question eval costs under $0.50 total.

LLM Evaluation in Production — Using Claude API as Judge to Detect Quality Regressions

Why LLMs Degrade Silently

The Evaluation Stack

Step 1: Build a Test Set

Step 2: LLM-as-Judge Evaluator

Step 3: Run Evaluations

Track Over Time

Run as Weekly CI Job

Stay ahead of the curve

Related Articles

Auto-Generate Terraform Modules Using OpenAI Function Calling

Build an AI Cloud Cost Anomaly Detector with Claude API + AWS Cost Explorer

Build an AI Deployment Health Checker with Claude API and Kubernetes

Comments