RAG Pipeline Evaluation with RAGAS + LangSmith in Production
Most teams ship RAG pipelines and never know if they're actually working. RAGAS gives you automated metrics — faithfulness, answer relevancy, context precision. LangSmith gives you tracing and regression testing. Here's how to wire both together.
Here's the RAG pipeline problem nobody talks about: you can't tell if it's working.
Your users say "the AI gave me wrong info." You check the logs — the LLM call succeeded, tokens were returned, latency was fine. Everything looks green. But the answer was wrong because the retrieval pulled the wrong chunks, and the LLM hallucinated on top of them.
Standard monitoring can't catch this. You need LLM-specific evaluation — and that's what RAGAS and LangSmith are built for.
The Four RAGAS Metrics You Actually Need
RAGAS (Retrieval Augmented Generation Assessment) computes four key metrics using LLM-as-judge:
1. Faithfulness — Does the answer contain only information from the retrieved context? Catches hallucination.
2. Answer Relevancy — Does the answer actually address the question? Catches vague or off-topic responses.
3. Context Precision — Are the retrieved chunks relevant to the question? Catches retrieval quality issues.
4. Context Recall — Does the retrieved context contain the information needed to answer? Catches when your vector DB is missing relevant content.
Together, they tell you whether the problem is in your retrieval (context precision/recall) or generation (faithfulness/answer relevancy). This distinction matters — the fix is completely different.
Setup
pip install ragas langsmith langchain langchain-openai langchain-community chromadb# .env
OPENAI_API_KEY=your_key
LANGCHAIN_API_KEY=your_langsmith_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=rag-evaluation
ANTHROPIC_API_KEY=your_anthropic_key # If using Claude for evaluationStep 1: Build a Traceable RAG Pipeline
LangSmith tracing is zero-configuration if you set the env vars above. Every LangChain call automatically gets traced.
# rag_pipeline.py
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
# LangSmith auto-traces everything when env vars are set
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
def build_rag_chain(docs: list[Document]):
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context.
If the context doesn't contain enough information to answer, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:""")
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return chain, retriever
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)Step 2: Build an Evaluation Dataset
RAGAS needs a dataset with questions, ground truth answers, and optional reference contexts.
# eval_dataset.py
from datasets import Dataset
# Build this from your actual user queries + curated correct answers
eval_questions = [
"What are the resource limits for the payment-service?",
"How do I enable mTLS in Istio?",
"What is the RTO for the database?",
"Which team owns the auth-service?",
"What ports does the API gateway expose?",
]
ground_truths = [
"The payment-service has CPU limits of 500m and memory limits of 512Mi.",
"Enable mTLS in Istio by setting PeerAuthentication mode to STRICT in the namespace.",
"The database has an RTO of 4 hours and RPO of 1 hour per the SLA document.",
"The auth-service is owned by the Identity Platform team (Slack: #team-identity).",
"The API gateway exposes ports 80 (HTTP), 443 (HTTPS), and 8443 (internal mTLS).",
]
def create_eval_dataset(chain, retriever) -> Dataset:
answers = []
contexts = []
for question in eval_questions:
# Get retrieved documents
retrieved_docs = retriever.get_relevant_documents(question)
context_texts = [doc.page_content for doc in retrieved_docs]
# Get the answer
answer = chain.invoke(question)
answers.append(answer)
contexts.append(context_texts)
return Dataset.from_dict({
"question": eval_questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths,
})Step 3: Run RAGAS Evaluation
# evaluate.py
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from langsmith import Client
from datetime import datetime
import json
langsmith_client = Client()
def run_evaluation(chain, retriever, run_name: str = None):
if not run_name:
run_name = f"eval-{datetime.now().strftime('%Y%m%d-%H%M')}"
print(f"Building evaluation dataset...")
dataset = create_eval_dataset(chain, retriever)
print(f"Running RAGAS evaluation...")
result = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
llm=ChatOpenAI(model="gpt-4o"), # Use a capable model for evaluation
)
scores = {
"faithfulness": result["faithfulness"],
"answer_relevancy": result["answer_relevancy"],
"context_precision": result["context_precision"],
"context_recall": result["context_recall"],
"overall": (
result["faithfulness"] +
result["answer_relevancy"] +
result["context_precision"] +
result["context_recall"]
) / 4
}
# Log to LangSmith as a dataset run
langsmith_client.create_run(
name=run_name,
run_type="chain",
inputs={"evaluation_questions": eval_questions},
outputs=scores,
tags=["ragas-evaluation", "automated"],
extra={"ragas_full_results": result.to_pandas().to_dict()}
)
return scores
def print_report(scores: dict):
print("\n" + "="*50)
print("RAG EVALUATION REPORT")
print("="*50)
print(f"Faithfulness: {scores['faithfulness']:.3f} {'✓' if scores['faithfulness'] > 0.8 else '✗'}")
print(f"Answer Relevancy: {scores['answer_relevancy']:.3f} {'✓' if scores['answer_relevancy'] > 0.8 else '✗'}")
print(f"Context Precision: {scores['context_precision']:.3f} {'✓' if scores['context_precision'] > 0.7 else '✗'}")
print(f"Context Recall: {scores['context_recall']:.3f} {'✓' if scores['context_recall'] > 0.7 else '✗'}")
print(f"Overall: {scores['overall']:.3f}")
print("="*50)
# Diagnosis
if scores['faithfulness'] < 0.8:
print("⚠️ Low faithfulness — LLM is hallucinating beyond the retrieved context")
print(" Fix: Tighten the system prompt, reduce LLM temperature, add stricter instructions")
if scores['context_precision'] < 0.7:
print("⚠️ Low context precision — retriever is pulling irrelevant chunks")
print(" Fix: Better chunking strategy, metadata filtering, or a reranker")
if scores['context_recall'] < 0.7:
print("⚠️ Low context recall — relevant information is missing from retrieved context")
print(" Fix: Increase retrieval k, check if documents are indexed correctly, improve embeddings")
if scores['answer_relevancy'] < 0.8:
print("⚠️ Low answer relevancy — answers are vague or off-topic")
print(" Fix: Better prompt engineering, check if questions are in the training distribution")Step 4: CI/CD Integration — Gate Deployments on Eval Scores
The point of evaluation is to prevent regressions. Add it to your deployment pipeline:
# .github/workflows/rag-eval.yml
name: RAG Pipeline Evaluation
on:
push:
branches: [main]
paths:
- 'rag/**'
- 'documents/**'
- 'prompts/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install ragas langsmith langchain langchain-openai chromadb
- name: Run RAG Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
LANGCHAIN_TRACING_V2: "true"
LANGCHAIN_PROJECT: "rag-evaluation-ci"
run: |
python evaluate_ci.py# evaluate_ci.py
import sys
from evaluate import run_evaluation, print_report
from rag_pipeline import build_rag_chain, load_documents
docs = load_documents("documents/")
chain, retriever = build_rag_chain(docs)
scores = run_evaluation(chain, retriever, run_name=f"ci-{os.getenv('GITHUB_SHA', 'local')[:8]}")
print_report(scores)
# Quality gates
THRESHOLDS = {
"faithfulness": 0.80,
"answer_relevancy": 0.75,
"context_precision": 0.70,
"context_recall": 0.70,
}
failures = [
metric for metric, threshold in THRESHOLDS.items()
if scores[metric] < threshold
]
if failures:
print(f"\nDEPLOYMENT BLOCKED: Metrics below threshold: {', '.join(failures)}")
sys.exit(1)
print("\nAll quality gates passed. Deployment approved.")
sys.exit(0)Reading Results in LangSmith
Once you have traces flowing, LangSmith gives you:
- Trace explorer — every RAG call as a tree: retrieval → prompt → LLM → output
- Latency breakdown — is your bottleneck in embedding, vector search, or LLM?
- Dataset runs — compare evaluation scores across deploys
- Annotation queues — route low-confidence answers to humans for review
The most useful view: Comparison mode. Evaluate your RAG pipeline before and after a change to your prompt or chunking strategy, and see the metrics side by side.
What Good Scores Look Like
In production systems I've evaluated:
| Stage | Faithfulness | Answer Relevancy | Context Precision | Context Recall |
|---|---|---|---|---|
| First deploy | 0.61 | 0.72 | 0.58 | 0.64 |
| After reranker | 0.74 | 0.78 | 0.81 | 0.71 |
| After prompt tuning | 0.88 | 0.85 | 0.81 | 0.71 |
The biggest single improvement was usually adding a cross-encoder reranker after initial vector retrieval. It dramatically improves context precision, which cascades into better faithfulness scores.
The second biggest: explicit "cite your sources" instructions in the system prompt, which forces the model to stay grounded in retrieved context.
The Key Insight
Most teams optimize the LLM (model choice, temperature, prompt) and ignore the retrieval. But RAGAS scores consistently show the retrieval is the bottleneck — context precision below 0.7 means you're feeding the LLM garbage, and the LLM can't generate good answers from garbage regardless of how good it is.
Fix retrieval first. Evaluate second. Optimize the LLM last.
Set up distributed tracing for your LLM calls: OpenTelemetry Complete Guide
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered Incident Report Generator with Claude API (2026)
Writing postmortems takes 2-3 hours. Here's how to build an AI tool that generates a structured incident report from Slack logs, metrics screenshots, and alert data in minutes.
Build an AI Alert Classifier for Grafana Using LLMs (2026)
Tired of noisy Grafana alerts that wake you up for nothing? Build an AI layer that classifies incoming alerts as actionable or noise, enriches them with context, and routes them intelligently — using Claude or GPT-4 as the reasoning engine.
LLM Gateway in Production: Multi-Provider Routing + Fallbacks with LiteLLM
Running one LLM provider in production is a single point of failure. Here's how to build an LLM gateway with LiteLLM that routes traffic, handles fallbacks, enforces cost limits, and gives you observability.