Build an AI Flaky Test Detector for GitHub Actions with Claude API
Build a Python tool using PyGithub and the Anthropic Claude API to detect flaky tests in GitHub Actions, analyze root causes with AI, and generate fix reports — runs as a weekly cron job.
A flaky test fails sometimes and passes other times on the same code. They are the most demoralizing thing in CI — they block merges, waste engineer time, and erode trust in the test suite. This project builds an AI-powered detector that reads your GitHub Actions history, identifies flaky tests automatically, and uses Claude to explain why they flake and how to fix them.
What We Are Building
- A Python script that fetches the last N workflow runs via the GitHub API
- Parses test results to find tests that both pass and fail across runs
- Sends flaky test patterns to Claude API for root cause analysis
- Generates a markdown report with fix suggestions
- A GitHub Actions workflow that runs this weekly and posts the report as an artifact
Prerequisites
pip install anthropic PyGithub python-dateutilSet these environment variables:
GITHUB_TOKEN— a personal access token withrepoandactions:readscopeANTHROPIC_API_KEY— your Anthropic API key
Step 1: Fetch Workflow Run History
# flaky_detector.py
import os
import json
from collections import defaultdict
from datetime import datetime, timedelta
from github import Github
import anthropic
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"]
REPO_NAME = os.environ.get("GITHUB_REPO", "myorg/myrepo")
WORKFLOW_NAME = os.environ.get("WORKFLOW_NAME", "CI")
LOOKBACK_DAYS = int(os.environ.get("LOOKBACK_DAYS", "30"))
MIN_RUNS_TO_FLAG = 3 # test must appear in at least this many runs
def fetch_test_results(repo, workflow_name, since_days):
"""Fetch test job logs from recent workflow runs."""
since = datetime.utcnow() - timedelta(days=since_days)
# Find the workflow
workflows = list(repo.get_workflows())
target = next((w for w in workflows if workflow_name.lower() in w.name.lower()), None)
if not target:
print(f"Workflow '{workflow_name}' not found. Available: {[w.name for w in workflows]}")
return []
runs = target.get_runs(status="completed", branch="main")
results = []
for run in runs:
if run.created_at.replace(tzinfo=None) < since:
break
for job in run.jobs():
for step in job.steps:
# Detect test step by common naming
if any(k in step.name.lower() for k in ["test", "pytest", "jest", "go test"]):
results.append({
"run_id": run.id,
"run_number": run.run_number,
"created_at": run.created_at.isoformat(),
"job_name": job.name,
"step_name": step.name,
"conclusion": step.conclusion, # "success" or "failure"
"commit_sha": run.head_sha,
})
return resultsStep 2: Parse Test Logs for Individual Test Names
GitHub Actions does not expose individual test names via the API — you need to parse the job logs:
def parse_test_names_from_log(log_text):
"""Extract test names and results from common test output formats."""
test_results = {}
lines = log_text.split("\n")
for line in lines:
# pytest format: "PASSED tests/test_api.py::test_connection" or "FAILED tests/..."
if line.startswith("PASSED ") or line.startswith("FAILED ") or line.startswith("ERROR "):
parts = line.split(" ", 1)
status = parts[0]
test_name = parts[1].strip() if len(parts) > 1 else "unknown"
test_results[test_name] = status
# Go test format: "--- PASS: TestFunctionName" or "--- FAIL: TestFunctionName"
elif line.strip().startswith("--- PASS:") or line.strip().startswith("--- FAIL:"):
parts = line.strip().split(":", 1)
status = "PASSED" if "PASS" in parts[0] else "FAILED"
test_name = parts[1].strip().split(" ")[0] if len(parts) > 1 else "unknown"
test_results[test_name] = status
return test_results
def get_job_log(repo, run_id, job_id):
"""Download raw log for a specific job."""
import requests
headers = {
"Authorization": f"token {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3+json",
}
url = f"https://api.github.com/repos/{repo.full_name}/actions/runs/{run_id}/jobs/{job_id}/logs"
response = requests.get(url, headers=headers, allow_redirects=True)
return response.text if response.status_code == 200 else ""Step 3: Identify Flaky Tests
def find_flaky_tests(all_run_results):
"""
A test is flaky if it appears as both PASSED and FAILED across different runs
on the SAME commit (ruling out real regressions) or different commits.
"""
test_history = defaultdict(lambda: {"passed": 0, "failed": 0, "run_ids": []})
for run in all_run_results:
for test_name, status in run.get("test_results", {}).items():
entry = test_history[test_name]
entry["run_ids"].append(run["run_id"])
if status == "PASSED":
entry["passed"] += 1
else:
entry["failed"] += 1
flaky = {}
for test_name, history in test_history.items():
total = history["passed"] + history["failed"]
if total < MIN_RUNS_TO_FLAG:
continue
if history["passed"] > 0 and history["failed"] > 0:
flakiness_rate = history["failed"] / total
flaky[test_name] = {
**history,
"flakiness_rate": round(flakiness_rate, 2),
"total_runs": total,
}
# Sort by flakiness rate descending
return dict(sorted(flaky.items(), key=lambda x: x[1]["flakiness_rate"], reverse=True))Step 4: Analyze Flaky Tests with Claude API
def analyze_flaky_tests_with_claude(flaky_tests, repo_name):
"""Send flaky test data to Claude for root cause analysis."""
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
test_summary = json.dumps(
{k: {
"flakiness_rate": v["flakiness_rate"],
"passed": v["passed"],
"failed": v["failed"],
"total_runs": v["total_runs"],
} for k, v in list(flaky_tests.items())[:20]}, # top 20
indent=2,
)
prompt = f"""You are a senior software engineer reviewing flaky tests in the CI pipeline for the repository '{repo_name}'.
Here are the flaky tests detected over the past {LOOKBACK_DAYS} days, with their pass/fail statistics:
{test_summary}
For each test, analyze the test name and failure pattern to infer likely root causes. Common flaky test causes include:
1. Timing issues (sleep/wait not long enough, race conditions)
2. Shared state between tests (global variables, database not reset)
3. External network or API dependency (timeouts, rate limits)
4. Test ordering dependency (test B requires test A to run first)
5. Resource contention (port conflicts, file locking)
6. Timezone or locale sensitivity
7. Random data generation without fixed seed
Return a JSON array where each item has:
- "test_name": the test identifier
- "likely_cause": one of [timing, shared_state, network, ordering, resource, locale, randomness, unknown]
- "explanation": 1-2 sentence explanation of why this test likely flakes
- "fix_suggestion": concrete code-level suggestion for fixing it
Return ONLY valid JSON, no markdown, no explanation outside the JSON."""
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}],
)
try:
analysis = json.loads(message.content[0].text)
return analysis
except json.JSONDecodeError:
# Claude sometimes returns valid JSON wrapped in markdown — strip it
raw = message.content[0].text
start = raw.find("[")
end = raw.rfind("]") + 1
return json.loads(raw[start:end]) if start >= 0 else []Step 5: Generate the Report
def generate_report(flaky_tests, analysis, output_path="flaky-test-report.md"):
"""Generate a markdown report combining statistics and AI analysis."""
analysis_map = {item["test_name"]: item for item in analysis}
lines = [
f"# Flaky Test Report — {datetime.utcnow().strftime('%Y-%m-%d')}",
f"\nAnalysis period: last {LOOKBACK_DAYS} days | Repo: `{REPO_NAME}`\n",
f"**{len(flaky_tests)} flaky tests detected**\n",
"---\n",
]
for test_name, stats in flaky_tests.items():
ai = analysis_map.get(test_name, {})
lines.append(f"## `{test_name}`")
lines.append(f"- Flakiness rate: **{stats['flakiness_rate']*100:.0f}%** ({stats['failed']} failed / {stats['total_runs']} runs)")
lines.append(f"- Likely cause: **{ai.get('likely_cause', 'unknown')}**")
lines.append(f"- Explanation: {ai.get('explanation', 'No AI analysis available.')}")
lines.append(f"- Fix suggestion: {ai.get('fix_suggestion', 'Investigate manually.')}")
lines.append("")
with open(output_path, "w") as f:
f.write("\n".join(lines))
print(f"Report written to {output_path}")
def main():
g = Github(GITHUB_TOKEN)
repo = g.get_repo(REPO_NAME)
print(f"Fetching workflow runs for '{WORKFLOW_NAME}' in {REPO_NAME}...")
run_results = fetch_test_results(repo, WORKFLOW_NAME, LOOKBACK_DAYS)
print(f"Found {len(run_results)} test steps across recent runs.")
flaky = find_flaky_tests(run_results)
if not flaky:
print("No flaky tests detected.")
return
print(f"Detected {len(flaky)} flaky tests. Sending to Claude for analysis...")
analysis = analyze_flaky_tests_with_claude(flaky, REPO_NAME)
generate_report(flaky, analysis)
if __name__ == "__main__":
main()Step 6: GitHub Actions Weekly Cron Job
# .github/workflows/flaky-detector.yml
name: Flaky Test Detector
on:
schedule:
- cron: "0 9 * * 1" # Every Monday at 9 AM UTC
workflow_dispatch: # Allow manual trigger
jobs:
detect-flaky-tests:
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install anthropic PyGithub python-dateutil requests
- name: Run flaky test detector
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_REPO: ${{ github.repository }}
WORKFLOW_NAME: "CI"
LOOKBACK_DAYS: "30"
run: python flaky_detector.py
- name: Upload report as artifact
uses: actions/upload-artifact@v4
with:
name: flaky-test-report-${{ github.run_number }}
path: flaky-test-report.md
retention-days: 90Running It Locally
export GITHUB_TOKEN=ghp_yourtoken
export ANTHROPIC_API_KEY=sk-ant-yourkey
export GITHUB_REPO=myorg/myrepo
export WORKFLOW_NAME=CI
export LOOKBACK_DAYS=14
python flaky_detector.py
cat flaky-test-report.mdExample Report Output
# Flaky Test Report — 2026-06-27
Analysis period: last 30 days | Repo: `myorg/myrepo`
**3 flaky tests detected**
## `tests/test_api.py::test_user_login`
- Flakiness rate: **43%** (13 failed / 30 runs)
- Likely cause: **network**
- Explanation: The test name suggests it connects to an auth endpoint. Intermittent
failures at 43% often indicate a test environment service that is not always
available or a connection timeout set too aggressively.
- Fix suggestion: Add retry logic with exponential backoff, or mock the auth service
in unit tests and move real endpoint testing to integration tests with longer timeouts.
The full project runs in under 2 minutes for repos with up to 50 workflow runs. Add Slack webhook notification as an extra step to post the report summary to your team channel each Monday.
Resources: Anthropic API docs, PyGithub docs, GitHub Actions cron syntax.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI GitHub Issue Triage Bot with Claude API
Automatically label, prioritize, and route GitHub issues using Claude API. Save your team hours of manual triage every week with this Python bot.
Build an AI-Powered CI/CD Pipeline Failure Analyzer with LangChain
Build a tool that automatically reads CI/CD failure logs, uses LangChain + Claude to diagnose the root cause, and posts a clear explanation with fix suggestions to your PR.
Build an AI GitHub PR Review Bot with Claude API (2026)
Build a GitHub Actions workflow that automatically reviews every pull request using Claude AI — catches bugs, security issues, and bad patterns before human review.