🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM Evaluation in Production — Using Claude API as Judge to Detect Quality Regressions

LLMs degrade silently. Here's how to build a weekly eval pipeline using Claude Haiku as an LLM judge, SQLite for score tracking, and GitHub Actions to alert on quality drops.

DevOpsBoysJun 11, 20265 min read
Share:Tweet

Your LLM application has no error rate spike, no latency increase, no alerts firing. But the answers it gives have quietly gotten worse. Users are leaving. You have no idea why.

This is the LLM quality problem. Here's how to catch it.


Why LLMs Degrade Silently

Unlike traditional software, LLMs don't throw exceptions when their output quality drops. A hallucinated answer, a missed nuance, a wrong code snippet — these look like successful responses from a monitoring perspective.

Common degradation triggers:

  • Model provider updates the underlying model weights
  • You changed your system prompt (even a small wording change)
  • Your RAG retrieval is returning worse chunks
  • The user query distribution shifted (users ask different things now)
  • Token limits or context window changes

The Evaluation Stack

Production Logs → Sample Queries → Eval Pipeline → Quality Score → Alert if drops

You need three things:

  1. A test set — representative queries with expected outputs
  2. An evaluator — something that judges quality (LLM-as-judge or deterministic)
  3. A baseline — the score from when things were working well

Step 1: Build a Test Set

python
# eval_dataset.py
import json
 
# Golden test set — representative queries + expected behavior
TEST_CASES = [
    {
        "id": "k8s-oomkilled-1",
        "query": "My Kubernetes pod is OOMKilled, how do I fix it?",
        "expected_contains": ["memory limit", "resources", "limits"],
        "should_not_contain": ["CPU", "network"],
        "category": "troubleshooting"
    },
    {
        "id": "terraform-state-1", 
        "query": "How do I import an existing AWS resource into Terraform state?",
        "expected_contains": ["terraform import", "resource address"],
        "category": "how-to"
    },
    {
        "id": "factual-1",
        "query": "What port does Kubernetes API server run on by default?",
        "expected_contains": ["6443"],
        "category": "factual"
    }
]
 
def save_test_set(path: str = "eval_dataset.json"):
    with open(path, "w") as f:
        json.dump(TEST_CASES, f, indent=2)

Step 2: LLM-as-Judge Evaluator

Use a fast, cheap model (Claude Haiku) to score your main model's outputs:

python
# evaluator.py
import anthropic
import json
import os
 
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
def judge_response(query: str, response: str, test_case: dict) -> dict:
    """Use Claude Haiku to judge response quality."""
    
    criteria = []
    if test_case.get("expected_contains"):
        criteria.append(f"Must mention: {', '.join(test_case['expected_contains'])}")
    if test_case.get("should_not_contain"):
        criteria.append(f"Must NOT mention: {', '.join(test_case['should_not_contain'])}")
    
    criteria_text = "\n".join(f"- {c}" for c in criteria) if criteria else "No specific criteria"
    
    prompt = f"""Rate this AI assistant response on a scale of 1-5.
 
User question: {query}
 
AI response: {response}
 
Evaluation criteria:
{criteria_text}
 
Score meaning:
5 = Perfect — accurate, complete, directly answers the question
4 = Good — mostly correct, minor gaps
3 = Acceptable — partially answers but missing important info
2 = Poor — wrong or misleading in key areas
1 = Fail — completely wrong or harmful
 
Respond with JSON only:
{{"score": <1-5>, "reason": "<one sentence>", "issues": ["<issue1>", "<issue2>"]}}"""
 
    response_obj = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    
    try:
        return json.loads(response_obj.content[0].text)
    except:
        return {"score": 3, "reason": "Parse error", "issues": []}
 
 
def deterministic_check(response: str, test_case: dict) -> dict:
    """Fast pattern-based checks without LLM cost."""
    issues = []
    score = 5
    
    response_lower = response.lower()
    
    for keyword in test_case.get("expected_contains", []):
        if keyword.lower() not in response_lower:
            issues.append(f"Missing expected keyword: '{keyword}'")
            score -= 1
    
    for keyword in test_case.get("should_not_contain", []):
        if keyword.lower() in response_lower:
            issues.append(f"Contains forbidden keyword: '{keyword}'")
            score -= 2
    
    return {"score": max(1, score), "issues": issues}

Step 3: Run Evaluations

python
# run_eval.py
import anthropic
import json
import os
from datetime import datetime
from evaluator import judge_response, deterministic_check
 
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
SYSTEM_PROMPT = "You are a DevOps assistant. Answer questions about Kubernetes, Docker, AWS, Terraform, and CI/CD."
 
def get_model_response(query: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": query}]
    )
    return response.content[0].text
 
def run_evaluation(test_cases: list[dict], use_llm_judge: bool = True) -> dict:
    results = []
    total_score = 0
    
    for tc in test_cases:
        response = get_model_response(tc["query"])
        
        # Fast deterministic check first
        det_result = deterministic_check(response, tc)
        
        # LLM judge for quality (optional, costs money)
        if use_llm_judge:
            llm_result = judge_response(tc["query"], response, tc)
            score = (det_result["score"] + llm_result["score"]) / 2
        else:
            score = det_result["score"]
            llm_result = {}
        
        results.append({
            "id": tc["id"],
            "category": tc["category"],
            "score": score,
            "deterministic": det_result,
            "llm_judge": llm_result,
            "response_preview": response[:200]
        })
        
        total_score += score
        print(f"  {tc['id']}: {score:.1f}/5")
    
    avg_score = total_score / len(test_cases) if test_cases else 0
    
    return {
        "timestamp": datetime.utcnow().isoformat(),
        "avg_score": round(avg_score, 2),
        "num_tests": len(test_cases),
        "results": results,
        "pass": avg_score >= 3.5   # Configurable threshold
    }
 
if __name__ == "__main__":
    with open("eval_dataset.json") as f:
        test_cases = json.load(f)
    
    print(f"Running {len(test_cases)} evaluations...")
    report = run_evaluation(test_cases)
    
    print(f"\nOverall score: {report['avg_score']}/5.0")
    print(f"Status: {'✅ PASS' if report['pass'] else '❌ FAIL'}")
    
    with open(f"eval_report_{datetime.utcnow().strftime('%Y%m%d_%H%M')}.json", "w") as f:
        json.dump(report, f, indent=2)

Track Over Time

python
# Store scores in a time-series to detect regressions
import sqlite3
 
def store_score(run_id: str, avg_score: float, model: str):
    conn = sqlite3.connect("eval_history.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS scores 
        (id TEXT, timestamp TEXT, score REAL, model TEXT)
    """)
    conn.execute("INSERT INTO scores VALUES (?, datetime('now'), ?, ?)",
                 (run_id, avg_score, model))
    conn.commit()
    conn.close()
 
def detect_regression(threshold: float = 0.3) -> bool:
    """Alert if score dropped more than threshold vs last week avg."""
    conn = sqlite3.connect("eval_history.db")
    cursor = conn.execute("""
        SELECT AVG(score) FROM scores 
        WHERE timestamp > datetime('now', '-7 days')
    """)
    last_week_avg = cursor.fetchone()[0]
    
    cursor = conn.execute("SELECT score FROM scores ORDER BY timestamp DESC LIMIT 1")
    latest = cursor.fetchone()[0]
    conn.close()
    
    if last_week_avg and latest:
        return (last_week_avg - latest) > threshold
    return False

Run as Weekly CI Job

yaml
# .github/workflows/llm-eval.yml
on:
  schedule:
    - cron: '0 6 * * 1'   # Every Monday 6 AM
  workflow_dispatch:
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install anthropic
      - run: python run_eval.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Fail if quality dropped
        run: python -c "import json; r=json.load(open('eval_report_latest.json')); exit(0 if r['pass'] else 1)"

Catch quality regressions before users notice. A weekly eval run against a golden test set gives you an early warning system that traditional monitoring completely misses.

The Anthropic API supports both the production model and the judge model — running a 50-question eval costs under $0.50 total.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments