🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

RAG Pipeline Evaluation with RAGAS + LangSmith in Production

Most teams ship RAG pipelines and never know if they're actually working. RAGAS gives you automated metrics — faithfulness, answer relevancy, context precision. LangSmith gives you tracing and regression testing. Here's how to wire both together.

DevOpsBoysJun 13, 20266 min read
Share:Tweet

Here's the RAG pipeline problem nobody talks about: you can't tell if it's working.

Your users say "the AI gave me wrong info." You check the logs — the LLM call succeeded, tokens were returned, latency was fine. Everything looks green. But the answer was wrong because the retrieval pulled the wrong chunks, and the LLM hallucinated on top of them.

Standard monitoring can't catch this. You need LLM-specific evaluation — and that's what RAGAS and LangSmith are built for.

The Four RAGAS Metrics You Actually Need

RAGAS (Retrieval Augmented Generation Assessment) computes four key metrics using LLM-as-judge:

1. Faithfulness — Does the answer contain only information from the retrieved context? Catches hallucination.

2. Answer Relevancy — Does the answer actually address the question? Catches vague or off-topic responses.

3. Context Precision — Are the retrieved chunks relevant to the question? Catches retrieval quality issues.

4. Context Recall — Does the retrieved context contain the information needed to answer? Catches when your vector DB is missing relevant content.

Together, they tell you whether the problem is in your retrieval (context precision/recall) or generation (faithfulness/answer relevancy). This distinction matters — the fix is completely different.

Setup

bash
pip install ragas langsmith langchain langchain-openai langchain-community chromadb
python
# .env
OPENAI_API_KEY=your_key
LANGCHAIN_API_KEY=your_langsmith_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=rag-evaluation
ANTHROPIC_API_KEY=your_anthropic_key  # If using Claude for evaluation

Step 1: Build a Traceable RAG Pipeline

LangSmith tracing is zero-configuration if you set the env vars above. Every LangChain call automatically gets traced.

python
# rag_pipeline.py
import os
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
 
# LangSmith auto-traces everything when env vars are set
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 
def build_rag_chain(docs: list[Document]):
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_documents(docs)
    
    vectorstore = Chroma.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context. 
If the context doesn't contain enough information to answer, say "I don't have enough information."
 
Context:
{context}
 
Question: {question}
 
Answer:""")
    
    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    
    return chain, retriever
 
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Step 2: Build an Evaluation Dataset

RAGAS needs a dataset with questions, ground truth answers, and optional reference contexts.

python
# eval_dataset.py
from datasets import Dataset
 
# Build this from your actual user queries + curated correct answers
eval_questions = [
    "What are the resource limits for the payment-service?",
    "How do I enable mTLS in Istio?",
    "What is the RTO for the database?",
    "Which team owns the auth-service?",
    "What ports does the API gateway expose?",
]
 
ground_truths = [
    "The payment-service has CPU limits of 500m and memory limits of 512Mi.",
    "Enable mTLS in Istio by setting PeerAuthentication mode to STRICT in the namespace.",
    "The database has an RTO of 4 hours and RPO of 1 hour per the SLA document.",
    "The auth-service is owned by the Identity Platform team (Slack: #team-identity).",
    "The API gateway exposes ports 80 (HTTP), 443 (HTTPS), and 8443 (internal mTLS).",
]
 
def create_eval_dataset(chain, retriever) -> Dataset:
    answers = []
    contexts = []
    
    for question in eval_questions:
        # Get retrieved documents
        retrieved_docs = retriever.get_relevant_documents(question)
        context_texts = [doc.page_content for doc in retrieved_docs]
        
        # Get the answer
        answer = chain.invoke(question)
        
        answers.append(answer)
        contexts.append(context_texts)
    
    return Dataset.from_dict({
        "question": eval_questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })

Step 3: Run RAGAS Evaluation

python
# evaluate.py
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from langsmith import Client
from datetime import datetime
import json
 
langsmith_client = Client()
 
def run_evaluation(chain, retriever, run_name: str = None):
    if not run_name:
        run_name = f"eval-{datetime.now().strftime('%Y%m%d-%H%M')}"
    
    print(f"Building evaluation dataset...")
    dataset = create_eval_dataset(chain, retriever)
    
    print(f"Running RAGAS evaluation...")
    result = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
        llm=ChatOpenAI(model="gpt-4o"),  # Use a capable model for evaluation
    )
    
    scores = {
        "faithfulness": result["faithfulness"],
        "answer_relevancy": result["answer_relevancy"],
        "context_precision": result["context_precision"],
        "context_recall": result["context_recall"],
        "overall": (
            result["faithfulness"] +
            result["answer_relevancy"] +
            result["context_precision"] +
            result["context_recall"]
        ) / 4
    }
    
    # Log to LangSmith as a dataset run
    langsmith_client.create_run(
        name=run_name,
        run_type="chain",
        inputs={"evaluation_questions": eval_questions},
        outputs=scores,
        tags=["ragas-evaluation", "automated"],
        extra={"ragas_full_results": result.to_pandas().to_dict()}
    )
    
    return scores
 
def print_report(scores: dict):
    print("\n" + "="*50)
    print("RAG EVALUATION REPORT")
    print("="*50)
    print(f"Faithfulness:      {scores['faithfulness']:.3f}  {'✓' if scores['faithfulness'] > 0.8 else '✗'}")
    print(f"Answer Relevancy:  {scores['answer_relevancy']:.3f}  {'✓' if scores['answer_relevancy'] > 0.8 else '✗'}")
    print(f"Context Precision: {scores['context_precision']:.3f}  {'✓' if scores['context_precision'] > 0.7 else '✗'}")
    print(f"Context Recall:    {scores['context_recall']:.3f}  {'✓' if scores['context_recall'] > 0.7 else '✗'}")
    print(f"Overall:           {scores['overall']:.3f}")
    print("="*50)
    
    # Diagnosis
    if scores['faithfulness'] < 0.8:
        print("⚠️  Low faithfulness — LLM is hallucinating beyond the retrieved context")
        print("   Fix: Tighten the system prompt, reduce LLM temperature, add stricter instructions")
    
    if scores['context_precision'] < 0.7:
        print("⚠️  Low context precision — retriever is pulling irrelevant chunks")
        print("   Fix: Better chunking strategy, metadata filtering, or a reranker")
    
    if scores['context_recall'] < 0.7:
        print("⚠️  Low context recall — relevant information is missing from retrieved context")
        print("   Fix: Increase retrieval k, check if documents are indexed correctly, improve embeddings")
    
    if scores['answer_relevancy'] < 0.8:
        print("⚠️  Low answer relevancy — answers are vague or off-topic")
        print("   Fix: Better prompt engineering, check if questions are in the training distribution")

Step 4: CI/CD Integration — Gate Deployments on Eval Scores

The point of evaluation is to prevent regressions. Add it to your deployment pipeline:

yaml
# .github/workflows/rag-eval.yml
name: RAG Pipeline Evaluation
 
on:
  push:
    branches: [main]
    paths:
      - 'rag/**'
      - 'documents/**'
      - 'prompts/**'
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: pip install ragas langsmith langchain langchain-openai chromadb
    
    - name: Run RAG Evaluation
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        LANGCHAIN_TRACING_V2: "true"
        LANGCHAIN_PROJECT: "rag-evaluation-ci"
      run: |
        python evaluate_ci.py
python
# evaluate_ci.py
import sys
from evaluate import run_evaluation, print_report
from rag_pipeline import build_rag_chain, load_documents
 
docs = load_documents("documents/")
chain, retriever = build_rag_chain(docs)
 
scores = run_evaluation(chain, retriever, run_name=f"ci-{os.getenv('GITHUB_SHA', 'local')[:8]}")
print_report(scores)
 
# Quality gates
THRESHOLDS = {
    "faithfulness": 0.80,
    "answer_relevancy": 0.75,
    "context_precision": 0.70,
    "context_recall": 0.70,
}
 
failures = [
    metric for metric, threshold in THRESHOLDS.items()
    if scores[metric] < threshold
]
 
if failures:
    print(f"\nDEPLOYMENT BLOCKED: Metrics below threshold: {', '.join(failures)}")
    sys.exit(1)
 
print("\nAll quality gates passed. Deployment approved.")
sys.exit(0)

Reading Results in LangSmith

Once you have traces flowing, LangSmith gives you:

  1. Trace explorer — every RAG call as a tree: retrieval → prompt → LLM → output
  2. Latency breakdown — is your bottleneck in embedding, vector search, or LLM?
  3. Dataset runs — compare evaluation scores across deploys
  4. Annotation queues — route low-confidence answers to humans for review

The most useful view: Comparison mode. Evaluate your RAG pipeline before and after a change to your prompt or chunking strategy, and see the metrics side by side.

What Good Scores Look Like

In production systems I've evaluated:

StageFaithfulnessAnswer RelevancyContext PrecisionContext Recall
First deploy0.610.720.580.64
After reranker0.740.780.810.71
After prompt tuning0.880.850.810.71

The biggest single improvement was usually adding a cross-encoder reranker after initial vector retrieval. It dramatically improves context precision, which cascades into better faithfulness scores.

The second biggest: explicit "cite your sources" instructions in the system prompt, which forces the model to stay grounded in retrieved context.

The Key Insight

Most teams optimize the LLM (model choice, temperature, prompt) and ignore the retrieval. But RAGAS scores consistently show the retrieval is the bottleneck — context precision below 0.7 means you're feeding the LLM garbage, and the LLM can't generate good answers from garbage regardless of how good it is.

Fix retrieval first. Evaluate second. Optimize the LLM last.

Set up distributed tracing for your LLM calls: OpenTelemetry Complete Guide

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments