🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI Flaky Test Detector for GitHub Actions with Claude API

Build a Python tool using PyGithub and the Anthropic Claude API to detect flaky tests in GitHub Actions, analyze root causes with AI, and generate fix reports — runs as a weekly cron job.

DevOpsBoys7 min read
Share:Tweet

A flaky test fails sometimes and passes other times on the same code. They are the most demoralizing thing in CI — they block merges, waste engineer time, and erode trust in the test suite. This project builds an AI-powered detector that reads your GitHub Actions history, identifies flaky tests automatically, and uses Claude to explain why they flake and how to fix them.

What We Are Building

  1. A Python script that fetches the last N workflow runs via the GitHub API
  2. Parses test results to find tests that both pass and fail across runs
  3. Sends flaky test patterns to Claude API for root cause analysis
  4. Generates a markdown report with fix suggestions
  5. A GitHub Actions workflow that runs this weekly and posts the report as an artifact

Prerequisites

bash
pip install anthropic PyGithub python-dateutil

Set these environment variables:

  • GITHUB_TOKEN — a personal access token with repo and actions:read scope
  • ANTHROPIC_API_KEY — your Anthropic API key

Step 1: Fetch Workflow Run History

python
# flaky_detector.py
import os
import json
from collections import defaultdict
from datetime import datetime, timedelta
from github import Github
import anthropic
 
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"]
REPO_NAME = os.environ.get("GITHUB_REPO", "myorg/myrepo")
WORKFLOW_NAME = os.environ.get("WORKFLOW_NAME", "CI")
LOOKBACK_DAYS = int(os.environ.get("LOOKBACK_DAYS", "30"))
MIN_RUNS_TO_FLAG = 3  # test must appear in at least this many runs
 
def fetch_test_results(repo, workflow_name, since_days):
    """Fetch test job logs from recent workflow runs."""
    since = datetime.utcnow() - timedelta(days=since_days)
    
    # Find the workflow
    workflows = list(repo.get_workflows())
    target = next((w for w in workflows if workflow_name.lower() in w.name.lower()), None)
    if not target:
        print(f"Workflow '{workflow_name}' not found. Available: {[w.name for w in workflows]}")
        return []
    
    runs = target.get_runs(status="completed", branch="main")
    results = []
    
    for run in runs:
        if run.created_at.replace(tzinfo=None) < since:
            break
        
        for job in run.jobs():
            for step in job.steps:
                # Detect test step by common naming
                if any(k in step.name.lower() for k in ["test", "pytest", "jest", "go test"]):
                    results.append({
                        "run_id": run.id,
                        "run_number": run.run_number,
                        "created_at": run.created_at.isoformat(),
                        "job_name": job.name,
                        "step_name": step.name,
                        "conclusion": step.conclusion,  # "success" or "failure"
                        "commit_sha": run.head_sha,
                    })
    
    return results

Step 2: Parse Test Logs for Individual Test Names

GitHub Actions does not expose individual test names via the API — you need to parse the job logs:

python
def parse_test_names_from_log(log_text):
    """Extract test names and results from common test output formats."""
    test_results = {}
    
    lines = log_text.split("\n")
    for line in lines:
        # pytest format: "PASSED tests/test_api.py::test_connection" or "FAILED tests/..."
        if line.startswith("PASSED ") or line.startswith("FAILED ") or line.startswith("ERROR "):
            parts = line.split(" ", 1)
            status = parts[0]
            test_name = parts[1].strip() if len(parts) > 1 else "unknown"
            test_results[test_name] = status
        
        # Go test format: "--- PASS: TestFunctionName" or "--- FAIL: TestFunctionName"
        elif line.strip().startswith("--- PASS:") or line.strip().startswith("--- FAIL:"):
            parts = line.strip().split(":", 1)
            status = "PASSED" if "PASS" in parts[0] else "FAILED"
            test_name = parts[1].strip().split(" ")[0] if len(parts) > 1 else "unknown"
            test_results[test_name] = status
    
    return test_results
 
def get_job_log(repo, run_id, job_id):
    """Download raw log for a specific job."""
    import requests
    headers = {
        "Authorization": f"token {GITHUB_TOKEN}",
        "Accept": "application/vnd.github.v3+json",
    }
    url = f"https://api.github.com/repos/{repo.full_name}/actions/runs/{run_id}/jobs/{job_id}/logs"
    response = requests.get(url, headers=headers, allow_redirects=True)
    return response.text if response.status_code == 200 else ""

Step 3: Identify Flaky Tests

python
def find_flaky_tests(all_run_results):
    """
    A test is flaky if it appears as both PASSED and FAILED across different runs
    on the SAME commit (ruling out real regressions) or different commits.
    """
    test_history = defaultdict(lambda: {"passed": 0, "failed": 0, "run_ids": []})
    
    for run in all_run_results:
        for test_name, status in run.get("test_results", {}).items():
            entry = test_history[test_name]
            entry["run_ids"].append(run["run_id"])
            if status == "PASSED":
                entry["passed"] += 1
            else:
                entry["failed"] += 1
    
    flaky = {}
    for test_name, history in test_history.items():
        total = history["passed"] + history["failed"]
        if total < MIN_RUNS_TO_FLAG:
            continue
        if history["passed"] > 0 and history["failed"] > 0:
            flakiness_rate = history["failed"] / total
            flaky[test_name] = {
                **history,
                "flakiness_rate": round(flakiness_rate, 2),
                "total_runs": total,
            }
    
    # Sort by flakiness rate descending
    return dict(sorted(flaky.items(), key=lambda x: x[1]["flakiness_rate"], reverse=True))

Step 4: Analyze Flaky Tests with Claude API

python
def analyze_flaky_tests_with_claude(flaky_tests, repo_name):
    """Send flaky test data to Claude for root cause analysis."""
    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
    
    test_summary = json.dumps(
        {k: {
            "flakiness_rate": v["flakiness_rate"],
            "passed": v["passed"],
            "failed": v["failed"],
            "total_runs": v["total_runs"],
        } for k, v in list(flaky_tests.items())[:20]},  # top 20
        indent=2,
    )
    
    prompt = f"""You are a senior software engineer reviewing flaky tests in the CI pipeline for the repository '{repo_name}'.
 
Here are the flaky tests detected over the past {LOOKBACK_DAYS} days, with their pass/fail statistics:
 
{test_summary}
 
For each test, analyze the test name and failure pattern to infer likely root causes. Common flaky test causes include:
1. Timing issues (sleep/wait not long enough, race conditions)
2. Shared state between tests (global variables, database not reset)
3. External network or API dependency (timeouts, rate limits)
4. Test ordering dependency (test B requires test A to run first)
5. Resource contention (port conflicts, file locking)
6. Timezone or locale sensitivity
7. Random data generation without fixed seed
 
Return a JSON array where each item has:
- "test_name": the test identifier
- "likely_cause": one of [timing, shared_state, network, ordering, resource, locale, randomness, unknown]
- "explanation": 1-2 sentence explanation of why this test likely flakes
- "fix_suggestion": concrete code-level suggestion for fixing it
 
Return ONLY valid JSON, no markdown, no explanation outside the JSON."""
 
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )
    
    try:
        analysis = json.loads(message.content[0].text)
        return analysis
    except json.JSONDecodeError:
        # Claude sometimes returns valid JSON wrapped in markdown — strip it
        raw = message.content[0].text
        start = raw.find("[")
        end = raw.rfind("]") + 1
        return json.loads(raw[start:end]) if start >= 0 else []

Step 5: Generate the Report

python
def generate_report(flaky_tests, analysis, output_path="flaky-test-report.md"):
    """Generate a markdown report combining statistics and AI analysis."""
    analysis_map = {item["test_name"]: item for item in analysis}
    
    lines = [
        f"# Flaky Test Report — {datetime.utcnow().strftime('%Y-%m-%d')}",
        f"\nAnalysis period: last {LOOKBACK_DAYS} days | Repo: `{REPO_NAME}`\n",
        f"**{len(flaky_tests)} flaky tests detected**\n",
        "---\n",
    ]
    
    for test_name, stats in flaky_tests.items():
        ai = analysis_map.get(test_name, {})
        lines.append(f"## `{test_name}`")
        lines.append(f"- Flakiness rate: **{stats['flakiness_rate']*100:.0f}%** ({stats['failed']} failed / {stats['total_runs']} runs)")
        lines.append(f"- Likely cause: **{ai.get('likely_cause', 'unknown')}**")
        lines.append(f"- Explanation: {ai.get('explanation', 'No AI analysis available.')}")
        lines.append(f"- Fix suggestion: {ai.get('fix_suggestion', 'Investigate manually.')}")
        lines.append("")
    
    with open(output_path, "w") as f:
        f.write("\n".join(lines))
    
    print(f"Report written to {output_path}")
 
def main():
    g = Github(GITHUB_TOKEN)
    repo = g.get_repo(REPO_NAME)
    
    print(f"Fetching workflow runs for '{WORKFLOW_NAME}' in {REPO_NAME}...")
    run_results = fetch_test_results(repo, WORKFLOW_NAME, LOOKBACK_DAYS)
    
    print(f"Found {len(run_results)} test steps across recent runs.")
    flaky = find_flaky_tests(run_results)
    
    if not flaky:
        print("No flaky tests detected.")
        return
    
    print(f"Detected {len(flaky)} flaky tests. Sending to Claude for analysis...")
    analysis = analyze_flaky_tests_with_claude(flaky, REPO_NAME)
    
    generate_report(flaky, analysis)
 
if __name__ == "__main__":
    main()

Step 6: GitHub Actions Weekly Cron Job

yaml
# .github/workflows/flaky-detector.yml
name: Flaky Test Detector
 
on:
  schedule:
    - cron: "0 9 * * 1"  # Every Monday at 9 AM UTC
  workflow_dispatch:      # Allow manual trigger
 
jobs:
  detect-flaky-tests:
    runs-on: ubuntu-latest
    permissions:
      actions: read
      contents: read
 
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
 
      - name: Install dependencies
        run: pip install anthropic PyGithub python-dateutil requests
 
      - name: Run flaky test detector
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_REPO: ${{ github.repository }}
          WORKFLOW_NAME: "CI"
          LOOKBACK_DAYS: "30"
        run: python flaky_detector.py
 
      - name: Upload report as artifact
        uses: actions/upload-artifact@v4
        with:
          name: flaky-test-report-${{ github.run_number }}
          path: flaky-test-report.md
          retention-days: 90

Running It Locally

bash
export GITHUB_TOKEN=ghp_yourtoken
export ANTHROPIC_API_KEY=sk-ant-yourkey
export GITHUB_REPO=myorg/myrepo
export WORKFLOW_NAME=CI
export LOOKBACK_DAYS=14
 
python flaky_detector.py
cat flaky-test-report.md

Example Report Output

# Flaky Test Report — 2026-06-27

Analysis period: last 30 days | Repo: `myorg/myrepo`
**3 flaky tests detected**

## `tests/test_api.py::test_user_login`
- Flakiness rate: **43%** (13 failed / 30 runs)
- Likely cause: **network**
- Explanation: The test name suggests it connects to an auth endpoint. Intermittent
  failures at 43% often indicate a test environment service that is not always
  available or a connection timeout set too aggressively.
- Fix suggestion: Add retry logic with exponential backoff, or mock the auth service
  in unit tests and move real endpoint testing to integration tests with longer timeouts.

The full project runs in under 2 minutes for repos with up to 50 workflow runs. Add Slack webhook notification as an extra step to post the report summary to your team channel each Monday.


Resources: Anthropic API docs, PyGithub docs, GitHub Actions cron syntax.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments