Build an AI-Powered CI/CD Pipeline Failure Analyzer with LangChain
Build a tool that automatically reads CI/CD failure logs, uses LangChain + Claude to diagnose the root cause, and posts a clear explanation with fix suggestions to your PR.
Pipeline failures are cryptic. A 500-line log file for a failed npm install. A Kubernetes deployment timeout with no clear cause. Junior engineers spending 2 hours debugging what an experienced engineer would spot in 5 minutes.
This tool reads your CI logs and tells you exactly what went wrong — and how to fix it.
What We're Building
GitHub Actions failure
→ Fetch logs via GitHub API
→ LangChain + Claude analyzes failure
→ Posts diagnosis as PR comment:
"Build failed because Node.js 16 is EOL and actions/setup-node@v3
no longer supports it. Fix: change node-version to '20' in your
workflow file."
Architecture
GitHub Actions (on workflow_run failure)
→ Python Lambda/Function
→ GitHub API (fetch logs)
→ LangChain (process + chunk long logs)
→ Claude API (diagnose)
→ GitHub API (post PR comment)
Setup
pip install langchain langchain-anthropic anthropic PyGithub python-dotenv# .env
ANTHROPIC_API_KEY=sk-ant-...
GITHUB_TOKEN=ghp-... # needs repo, workflow permissionsStep 1: Fetch CI Logs
# github_client.py
from github import Github
import os
g = Github(os.getenv("GITHUB_TOKEN"))
def get_failed_job_logs(repo_name: str, run_id: int) -> list[dict]:
"""Get logs from all failed jobs in a workflow run."""
repo = g.get_repo(repo_name)
run = repo.get_workflow_run(run_id)
failed_jobs = []
for job in run.jobs():
if job.conclusion == "failure":
# Get logs for this job
logs_url = job.logs_url()
# Download logs
import requests
headers = {"Authorization": f"token {os.getenv('GITHUB_TOKEN')}"}
response = requests.get(logs_url, headers=headers)
failed_jobs.append({
"job_name": job.name,
"logs": response.text,
"failed_steps": [
step.name for step in job.steps
if step.conclusion == "failure"
]
})
return failed_jobs
def post_pr_comment(repo_name: str, pr_number: int, comment: str):
"""Post analysis as PR comment."""
repo = g.get_repo(repo_name)
pr = repo.get_pull(pr_number)
pr.create_issue_comment(comment)Step 2: LangChain Log Processor
CI logs can be 10,000+ lines. We need to chunk them intelligently:
# log_processor.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_anthropic import ChatAnthropic
from langchain.schema import HumanMessage, SystemMessage
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
llm = ChatAnthropic(
model="claude-sonnet-4-6",
anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
max_tokens=2000
)
def extract_relevant_log_section(logs: str, failed_steps: list[str]) -> str:
"""Extract the most relevant 200 lines around failures."""
lines = logs.split('\n')
# Find lines with error keywords
error_keywords = ['ERROR', 'Error', 'FAILED', 'Failed', 'error:', 'fatal:',
'npm ERR!', 'FAIL', 'Exception', 'Traceback']
relevant_lines = []
for i, line in enumerate(lines):
if any(kw in line for kw in error_keywords):
# Grab 10 lines before and after each error
start = max(0, i - 10)
end = min(len(lines), i + 10)
relevant_lines.extend(lines[start:end])
# Deduplicate while preserving order
seen = set()
unique_lines = []
for line in relevant_lines:
if line not in seen:
seen.add(line)
unique_lines.append(line)
return '\n'.join(unique_lines[:200]) # Max 200 lines
def analyze_failure(job_name: str, logs: str, failed_steps: list[str]) -> str:
"""Use Claude to analyze the failure."""
# Extract relevant section to stay within context
relevant_logs = extract_relevant_log_section(logs, failed_steps)
system_prompt = """You are a CI/CD expert helping developers fix pipeline failures.
Analyze the provided CI/CD logs and:
1. Identify the ROOT CAUSE (not symptoms) in 1-2 sentences
2. List the exact fix (specific file changes, commands, or config updates)
3. Explain why this happened
4. Add any prevention tips
Be specific and actionable. Reference exact line numbers or error messages from the logs.
Format your response in Markdown for a GitHub PR comment."""
prompt = (
f"**Failed Job:** {job_name}\n"
f"**Failed Steps:** {', '.join(failed_steps)}\n\n"
f"**Relevant Log Output:**\n{relevant_logs}\n\n"
"Diagnose this CI/CD failure and provide a fix."
)
messages = [
SystemMessage(content=system_prompt),
HumanMessage(content=prompt)
]
response = llm.invoke(messages)
return response.contentStep 3: Format PR Comment
# comment_formatter.py
def format_pr_comment(analyses: list[dict]) -> str:
"""Format all job analyses into a PR comment."""
comment = "## 🤖 CI Failure Analysis\n\n"
comment += "> Automatically analyzed by DevOps AI Assistant\n\n"
for analysis in analyses:
comment += f"### ❌ {analysis['job_name']}\n\n"
comment += analysis['diagnosis'] + "\n\n"
comment += "---\n\n"
comment += "*Analysis powered by Claude AI. Always verify before applying fixes.*"
return commentStep 4: GitHub Actions Trigger
# .github/workflows/analyze-failures.yml
name: Analyze CI Failures
on:
workflow_run:
workflows: ["CI", "Build and Test", "Deploy"]
types: [completed]
jobs:
analyze:
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install langchain langchain-anthropic PyGithub python-dotenv requests
- name: Analyze failure and comment
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO_NAME: ${{ github.repository }}
RUN_ID: ${{ github.event.workflow_run.id }}
PR_NUMBER: ${{ github.event.workflow_run.pull_requests[0].number }}
run: python analyze.pyStep 5: Main Script
# analyze.py
import os
from dotenv import load_dotenv
from github_client import get_failed_job_logs, post_pr_comment
from log_processor import analyze_failure
from comment_formatter import format_pr_comment
load_dotenv()
def main():
repo_name = os.getenv("REPO_NAME")
run_id = int(os.getenv("RUN_ID"))
pr_number = os.getenv("PR_NUMBER")
if not pr_number:
print("No PR associated with this run, skipping comment")
return
print(f"Fetching logs for run {run_id}...")
failed_jobs = get_failed_job_logs(repo_name, run_id)
if not failed_jobs:
print("No failed jobs found")
return
analyses = []
for job in failed_jobs:
print(f"Analyzing: {job['job_name']}")
diagnosis = analyze_failure(
job['job_name'],
job['logs'],
job['failed_steps']
)
analyses.append({
"job_name": job['job_name'],
"diagnosis": diagnosis
})
comment = format_pr_comment(analyses)
post_pr_comment(repo_name, int(pr_number), comment)
print("Comment posted successfully")
if __name__ == "__main__":
main()Example Output on a Real PR
## 🤖 CI Failure Analysis
> Automatically analyzed by DevOps AI Assistant
### ❌ Build and Test
**Root Cause:** The `npm ci` command failed because `package-lock.json`
is out of sync with `package.json`. A new dependency was added directly
to `package.json` without running `npm install` to update the lock file.
**Fix:** Run locally and commit the updated lock file:
```bash
npm install
git add package-lock.json
git commit -m "fix: update package-lock.json"Why it happened: npm ci requires exact lock file consistency and
fails if there's any mismatch — this is by design for reproducible builds.
Prevention: Add a CI check that verifies lock file consistency:
- run: npm ci --dry-run
The tool turns a 847-line log file into a 5-line fix. Engineers spend their time fixing, not reading logs.
> Get your [Anthropic API key](https://console.anthropic.com/) to start building. Token costs per analysis: ~500–1000 input tokens = $0.001–0.003 per failure analysis.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Auto-Generate Terraform Modules Using OpenAI Function Calling
Build a tool that takes plain English descriptions and generates production-ready Terraform modules using OpenAI's function calling API. No more starting from scratch.
Build an AI Code Review Bot with GitHub Actions and Claude API (2026)
Automate code reviews on every PR using Claude AI via GitHub Actions. The bot reviews Dockerfile security, Terraform changes, and general code quality — and posts inline comments.
Build an AI-Powered Dockerfile Security Scanner with Claude
Build a tool that scans Dockerfiles for security issues using Claude API — finds hardcoded secrets, root users, unscanned base images, and missing security best practices.