Building a Human-in-the-Loop Feedback Pipeline for LLMs in Production
Automated evals catch some quality regressions, but real user feedback catches what your test set never anticipated. Here's how to build a feedback loop that actually improves prompts and routing over time.
Automated evaluation (RAGAS scores, regression test suites) tells you whether output matches patterns you anticipated. It can't tell you about failure modes you never thought to test for — the question phrasing nobody on your team imagined, the edge case in a customer's specific workflow. A human-in-the-loop feedback pipeline closes that gap by routing real signal from actual usage back into improvement decisions.
The Pipeline
LLM response shown to user
↓
Implicit signal (did they accept, edit, re-ask?) + Explicit signal (thumbs up/down)
↓
Logged with full context (prompt version, input, output, signal)
↓
Weekly: cluster negative feedback by pattern
↓
Route clusters to: prompt fix, retrieval fix, or "model limitation, needs human"
↓
Track whether fixes actually reduced that pattern's recurrence
Step 1: Capture Both Implicit and Explicit Signal
Explicit feedback (thumbs up/down) has low response rates — most users don't bother. Implicit signal fills the gap.
from dataclasses import dataclass
from enum import Enum
import time
class ImplicitSignal(Enum):
ACCEPTED_AS_IS = "accepted" # user used the output without changes
EDITED = "edited" # user modified before using
REGENERATED = "regenerated" # user asked for a different response
ABANDONED = "abandoned" # user left without using the output
@dataclass
class FeedbackEvent:
request_id: str
prompt_version: str
user_input: str
llm_output: str
implicit_signal: ImplicitSignal
explicit_rating: int | None # 1 (bad) to 5 (good), or None if not given
edited_output: str | None # what the user changed it to, if edited
timestamp: float
def log_feedback(event: FeedbackEvent):
# Store in your warehouse — this becomes the dataset improvement decisions come from
db.insert("llm_feedback", event.__dict__)# Capturing "regenerated" signal — the strongest implicit negative signal there is
@app.post("/api/regenerate")
def regenerate_response(request_id: str):
original = get_original_request(request_id)
log_feedback(FeedbackEvent(
request_id=request_id,
prompt_version=original.prompt_version,
user_input=original.input,
llm_output=original.output,
implicit_signal=ImplicitSignal.REGENERATED,
explicit_rating=None,
edited_output=None,
timestamp=time.time()
))
return generate_new_response(original.input)REGENERATED and EDITED are your highest-signal-to-noise feedback — a user who clicked "try again" or significantly rewrote the output is telling you something failed, with zero survey friction.
Step 2: Weekly Clustering of Negative Signal
from anthropic import Anthropic
import json
client = Anthropic()
def cluster_negative_feedback(week_start: str, week_end: str) -> list[dict]:
negative_events = db.query("""
SELECT user_input, llm_output, implicit_signal, edited_output
FROM llm_feedback
WHERE timestamp BETWEEN %s AND %s
AND (implicit_signal IN ('regenerated', 'abandoned') OR explicit_rating <= 2)
LIMIT 200
""", (week_start, week_end))
samples = "\n---\n".join(
f"Input: {e.user_input}\nOutput: {e.llm_output[:300]}\nSignal: {e.implicit_signal}"
for e in negative_events
)
prompt = f"""Here are {len(negative_events)} examples of LLM responses that
received negative user feedback this week:
{samples}
Cluster these into 3-7 distinct failure patterns. For each pattern, provide:
- pattern_name: short label
- description: what's going wrong
- example_count: how many of the samples fit this pattern
- likely_cause: prompt issue, retrieval/context issue, or genuine model limitation
- suggested_fix: specific, actionable
Return as JSON: {{"patterns": [...]}}"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)["patterns"]Step 3: Route Each Pattern to the Right Fix
def route_pattern_for_action(pattern: dict) -> str:
if pattern["likely_cause"] == "prompt issue":
return create_prompt_improvement_ticket(pattern)
elif pattern["likely_cause"] == "retrieval/context issue":
return create_retrieval_investigation_ticket(pattern)
else:
# genuine model limitation — track it, but don't expect a prompt fix to solve it
return flag_for_escalation_path(pattern)def create_prompt_improvement_ticket(pattern: dict) -> str:
ticket_body = f"""
**Failure pattern:** {pattern['pattern_name']}
**Frequency:** {pattern['example_count']} instances this week
**Description:** {pattern['description']}
**Suggested fix:** {pattern['suggested_fix']}
Action: Draft a new prompt version addressing this, test against the
regression suite, canary roll out per the prompt versioning process.
"""
return create_jira_ticket(title=f"Prompt fix: {pattern['pattern_name']}", body=ticket_body)Step 4: Close the Loop — Verify Fixes Actually Worked
This step is the one teams skip most often, and it's the one that makes the whole pipeline worth running.
def verify_fix_effectiveness(pattern_name: str, fix_deployed_date: str) -> dict:
before = count_pattern_occurrences(pattern_name, end_date=fix_deployed_date, window_days=14)
after = count_pattern_occurrences(pattern_name, start_date=fix_deployed_date, window_days=14)
reduction_pct = ((before - after) / before * 100) if before else 0
return {
"pattern": pattern_name,
"occurrences_before": before,
"occurrences_after": after,
"reduction_pct": round(reduction_pct, 1),
"fix_effective": reduction_pct > 30 # set your own bar for "this actually worked"
}Without this verification step, you're shipping fixes based on a hypothesis ("this should help") without ever confirming it did. With it, you build an actual feedback loop — pattern identified, fix shipped, fix measured, and if it didn't work, the pattern goes back into next week's clustering instead of being marked "resolved" on faith.
Why This Matters More Than Pre-Launch Evals
Eval suites are built from cases your team imagined in advance. Real usage finds the cases nobody imagined — the specific phrasing your actual users use, the edge cases in their actual workflows, the failure modes that only emerge at the volume and diversity of real traffic. A human-in-the-loop pipeline is how production LLM systems keep improving after launch, instead of slowly drifting out of sync with how people actually use them.
Set up the automated eval layer this complements: LLM Evaluation and Testing in Production
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered Incident Report Generator with Claude API (2026)
Writing postmortems takes 2-3 hours. Here's how to build an AI tool that generates a structured incident report from Slack logs, metrics screenshots, and alert data in minutes.
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI Terraform Cost Estimator Using Claude (2026)
Before you run terraform apply, wouldn't you want to know how much it'll cost? Build an AI cost estimator that reads your Terraform plan output and gives you a detailed cost breakdown using Claude as the reasoning engine.