🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI-Powered CI/CD Pipeline Failure Analyzer with LangChain

Build a tool that automatically reads CI/CD failure logs, uses LangChain + Claude to diagnose the root cause, and posts a clear explanation with fix suggestions to your PR.

DevOpsBoysMay 30, 20265 min read
Share:Tweet

Pipeline failures are cryptic. A 500-line log file for a failed npm install. A Kubernetes deployment timeout with no clear cause. Junior engineers spending 2 hours debugging what an experienced engineer would spot in 5 minutes.

This tool reads your CI logs and tells you exactly what went wrong — and how to fix it.


What We're Building

GitHub Actions failure
    → Fetch logs via GitHub API
    → LangChain + Claude analyzes failure
    → Posts diagnosis as PR comment:

"Build failed because Node.js 16 is EOL and actions/setup-node@v3 
no longer supports it. Fix: change node-version to '20' in your 
workflow file."

Architecture

GitHub Actions (on workflow_run failure)
    → Python Lambda/Function
    → GitHub API (fetch logs)
    → LangChain (process + chunk long logs)
    → Claude API (diagnose)
    → GitHub API (post PR comment)

Setup

bash
pip install langchain langchain-anthropic anthropic PyGithub python-dotenv
bash
# .env
ANTHROPIC_API_KEY=sk-ant-...
GITHUB_TOKEN=ghp-...  # needs repo, workflow permissions

Step 1: Fetch CI Logs

python
# github_client.py
from github import Github
import os
 
g = Github(os.getenv("GITHUB_TOKEN"))
 
def get_failed_job_logs(repo_name: str, run_id: int) -> list[dict]:
    """Get logs from all failed jobs in a workflow run."""
    repo = g.get_repo(repo_name)
    run = repo.get_workflow_run(run_id)
    
    failed_jobs = []
    
    for job in run.jobs():
        if job.conclusion == "failure":
            # Get logs for this job
            logs_url = job.logs_url()
            
            # Download logs
            import requests
            headers = {"Authorization": f"token {os.getenv('GITHUB_TOKEN')}"}
            response = requests.get(logs_url, headers=headers)
            
            failed_jobs.append({
                "job_name": job.name,
                "logs": response.text,
                "failed_steps": [
                    step.name for step in job.steps 
                    if step.conclusion == "failure"
                ]
            })
    
    return failed_jobs
 
def post_pr_comment(repo_name: str, pr_number: int, comment: str):
    """Post analysis as PR comment."""
    repo = g.get_repo(repo_name)
    pr = repo.get_pull(pr_number)
    pr.create_issue_comment(comment)

Step 2: LangChain Log Processor

CI logs can be 10,000+ lines. We need to chunk them intelligently:

python
# log_processor.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage, SystemMessage
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
 
llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
    max_tokens=2000
)
 
def extract_relevant_log_section(logs: str, failed_steps: list[str]) -> str:
    """Extract the most relevant 200 lines around failures."""
    lines = logs.split('\n')
    
    # Find lines with error keywords
    error_keywords = ['ERROR', 'Error', 'FAILED', 'Failed', 'error:', 'fatal:', 
                      'npm ERR!', 'FAIL', 'Exception', 'Traceback']
    
    relevant_lines = []
    for i, line in enumerate(lines):
        if any(kw in line for kw in error_keywords):
            # Grab 10 lines before and after each error
            start = max(0, i - 10)
            end = min(len(lines), i + 10)
            relevant_lines.extend(lines[start:end])
    
    # Deduplicate while preserving order
    seen = set()
    unique_lines = []
    for line in relevant_lines:
        if line not in seen:
            seen.add(line)
            unique_lines.append(line)
    
    return '\n'.join(unique_lines[:200])  # Max 200 lines
 
def analyze_failure(job_name: str, logs: str, failed_steps: list[str]) -> str:
    """Use Claude to analyze the failure."""
    
    # Extract relevant section to stay within context
    relevant_logs = extract_relevant_log_section(logs, failed_steps)
    
    system_prompt = """You are a CI/CD expert helping developers fix pipeline failures.
    
Analyze the provided CI/CD logs and:
1. Identify the ROOT CAUSE (not symptoms) in 1-2 sentences
2. List the exact fix (specific file changes, commands, or config updates)
3. Explain why this happened
4. Add any prevention tips
 
Be specific and actionable. Reference exact line numbers or error messages from the logs.
Format your response in Markdown for a GitHub PR comment."""
 
    prompt = (
        f"**Failed Job:** {job_name}\n"
        f"**Failed Steps:** {', '.join(failed_steps)}\n\n"
        f"**Relevant Log Output:**\n{relevant_logs}\n\n"
        "Diagnose this CI/CD failure and provide a fix."
    )
    
    messages = [
        SystemMessage(content=system_prompt),
        HumanMessage(content=prompt)
    ]
    
    response = llm.invoke(messages)
    return response.content

Step 3: Format PR Comment

python
# comment_formatter.py
 
def format_pr_comment(analyses: list[dict]) -> str:
    """Format all job analyses into a PR comment."""
    
    comment = "## 🤖 CI Failure Analysis\n\n"
    comment += "> Automatically analyzed by DevOps AI Assistant\n\n"
    
    for analysis in analyses:
        comment += f"### ❌ {analysis['job_name']}\n\n"
        comment += analysis['diagnosis'] + "\n\n"
        comment += "---\n\n"
    
    comment += "*Analysis powered by Claude AI. Always verify before applying fixes.*"
    
    return comment

Step 4: GitHub Actions Trigger

yaml
# .github/workflows/analyze-failures.yml
name: Analyze CI Failures
 
on:
  workflow_run:
    workflows: ["CI", "Build and Test", "Deploy"]
    types: [completed]
 
jobs:
  analyze:
    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - run: pip install langchain langchain-anthropic PyGithub python-dotenv requests
      
      - name: Analyze failure and comment
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO_NAME: ${{ github.repository }}
          RUN_ID: ${{ github.event.workflow_run.id }}
          PR_NUMBER: ${{ github.event.workflow_run.pull_requests[0].number }}
        run: python analyze.py

Step 5: Main Script

python
# analyze.py
import os
from dotenv import load_dotenv
from github_client import get_failed_job_logs, post_pr_comment
from log_processor import analyze_failure
from comment_formatter import format_pr_comment
 
load_dotenv()
 
def main():
    repo_name = os.getenv("REPO_NAME")
    run_id = int(os.getenv("RUN_ID"))
    pr_number = os.getenv("PR_NUMBER")
    
    if not pr_number:
        print("No PR associated with this run, skipping comment")
        return
    
    print(f"Fetching logs for run {run_id}...")
    failed_jobs = get_failed_job_logs(repo_name, run_id)
    
    if not failed_jobs:
        print("No failed jobs found")
        return
    
    analyses = []
    for job in failed_jobs:
        print(f"Analyzing: {job['job_name']}")
        diagnosis = analyze_failure(
            job['job_name'],
            job['logs'],
            job['failed_steps']
        )
        analyses.append({
            "job_name": job['job_name'],
            "diagnosis": diagnosis
        })
    
    comment = format_pr_comment(analyses)
    post_pr_comment(repo_name, int(pr_number), comment)
    print("Comment posted successfully")
 
if __name__ == "__main__":
    main()

Example Output on a Real PR

markdown
## 🤖 CI Failure Analysis
 
> Automatically analyzed by DevOps AI Assistant
 
### ❌ Build and Test
 
**Root Cause:** The `npm ci` command failed because `package-lock.json` 
is out of sync with `package.json`. A new dependency was added directly 
to `package.json` without running `npm install` to update the lock file.
 
**Fix:** Run locally and commit the updated lock file:
```bash
npm install
git add package-lock.json
git commit -m "fix: update package-lock.json"

Why it happened: npm ci requires exact lock file consistency and fails if there's any mismatch — this is by design for reproducible builds.

Prevention: Add a CI check that verifies lock file consistency:

yaml
- run: npm ci --dry-run

The tool turns a 847-line log file into a 5-line fix. Engineers spend their time fixing, not reading logs.

> Get your [Anthropic API key](https://console.anthropic.com/) to start building. Token costs per analysis: ~500–1000 input tokens = $0.001–0.003 per failure analysis.
🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments