🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build a GitOps Drift Detector with LangChain + ArgoCD API

ArgoCD tells you when drift happens — but not why it matters or what to do. Build an AI agent with LangChain that detects drift, explains the risk, and suggests the right fix.

DevOpsBoysJun 12, 20266 min read
Share:Tweet

ArgoCD already detects drift. It compares your Git state with your cluster state and tells you when they don't match. That part works fine.

What it doesn't do: explain why that drift is dangerous, what changed, whether it was intentional, and what the operator should do next.

This is where an AI agent changes the workflow completely. Let's build a LangChain agent that polls ArgoCD for out-of-sync applications, analyzes the drift, and generates a human-readable incident summary with remediation steps.

What We're Building

ArgoCD API → Drift Detection → LangChain Agent → Analysis + Remediation Report

The agent will:

  1. Poll ArgoCD API for all applications
  2. Find out-of-sync applications
  3. Fetch the diff between desired (Git) and live (cluster) state
  4. Use an LLM to analyze the diff and classify the risk
  5. Generate a structured report with: what changed, why it's risky, recommended action

Prerequisites

bash
pip install langchain langchain-anthropic argocd-python-client python-dotenv requests

You need:

  • ArgoCD running with API access
  • Anthropic API key (or swap for OpenAI)
  • ArgoCD API token

Step 1: ArgoCD API Client

python
# argocd_client.py
import requests
import json
from dataclasses import dataclass
from typing import List, Optional
 
@dataclass
class DriftedApp:
    name: str
    namespace: str
    sync_status: str
    health_status: str
    repo_url: str
    target_revision: str
    diff: str
 
class ArgoCDClient:
    def __init__(self, server_url: str, token: str, verify_ssl: bool = True):
        self.server_url = server_url.rstrip('/')
        self.headers = {
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        }
        self.verify_ssl = verify_ssl
 
    def get_all_applications(self) -> List[dict]:
        resp = requests.get(
            f"{self.server_url}/api/v1/applications",
            headers=self.headers,
            verify=self.verify_ssl
        )
        resp.raise_for_status()
        return resp.json().get("items", [])
 
    def get_application_diff(self, app_name: str) -> str:
        resp = requests.get(
            f"{self.server_url}/api/v1/applications/{app_name}/resource-tree",
            headers=self.headers,
            verify=self.verify_ssl
        )
        resp.raise_for_status()
        tree = resp.json()
        
        # Get manifests diff
        diff_resp = requests.get(
            f"{self.server_url}/api/v1/applications/{app_name}/manifests",
            headers=self.headers,
            verify=self.verify_ssl
        )
        if diff_resp.status_code == 200:
            return json.dumps(diff_resp.json(), indent=2)[:3000]  # Limit size
        return "Diff unavailable"
 
    def get_drifted_apps(self) -> List[DriftedApp]:
        apps = self.get_all_applications()
        drifted = []
        
        for app in apps:
            sync_status = app.get("status", {}).get("sync", {}).get("status", "")
            if sync_status == "OutOfSync":
                name = app["metadata"]["name"]
                diff = self.get_application_diff(name)
                
                drifted.append(DriftedApp(
                    name=name,
                    namespace=app["metadata"]["namespace"],
                    sync_status=sync_status,
                    health_status=app.get("status", {}).get("health", {}).get("status", "Unknown"),
                    repo_url=app.get("spec", {}).get("source", {}).get("repoURL", ""),
                    target_revision=app.get("spec", {}).get("source", {}).get("targetRevision", ""),
                    diff=diff
                ))
        
        return drifted

Step 2: LangChain Agent with Analysis Tools

python
# drift_analyzer.py
import os
from langchain_anthropic import ChatAnthropic
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from argocd_client import ArgoCDClient, DriftedApp
 
# Initialize clients
argocd = ArgoCDClient(
    server_url=os.getenv("ARGOCD_SERVER_URL"),
    token=os.getenv("ARGOCD_TOKEN"),
    verify_ssl=False  # Set True in production with valid certs
)
 
llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    max_tokens=4096
)
 
@tool
def get_drifted_applications() -> str:
    """Fetch all applications from ArgoCD that are out of sync with their Git source."""
    drifted = argocd.get_drifted_apps()
    if not drifted:
        return "No drifted applications found. All applications are in sync."
    
    summary = []
    for app in drifted:
        summary.append(f"""
Application: {app.name}
Namespace: {app.namespace}
Sync Status: {app.sync_status}
Health Status: {app.health_status}
Repository: {app.repo_url}
Target Revision: {app.target_revision}
""")
    return "\n---\n".join(summary)
 
@tool
def get_application_drift_details(app_name: str) -> str:
    """
    Get the detailed diff for a specific application showing what changed 
    between the desired state (Git) and the live state (cluster).
    
    Args:
        app_name: Name of the ArgoCD application to inspect
    """
    drifted = argocd.get_drifted_apps()
    for app in drifted:
        if app.name == app_name:
            return f"""
Application: {app.name}
Health: {app.health_status}
 
Manifest Diff:
{app.diff}
"""
    return f"Application {app_name} is not drifted or does not exist."
 
@tool  
def classify_drift_risk(app_name: str, diff_content: str) -> str:
    """
    Classify the risk level of detected drift. Use this after getting drift details.
    
    Args:
        app_name: Application name
        diff_content: The diff content to analyze
    """
    # This tool triggers deeper LLM analysis
    return f"Risk classification requested for {app_name}. Analyze the diff: {diff_content[:1000]}"
 
# Define the agent prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a senior GitOps engineer with deep expertise in Kubernetes and ArgoCD.
 
Your job is to detect configuration drift, analyze what changed, assess the risk, and recommend remediation steps.
 
When analyzing drift:
1. Always start by fetching drifted applications
2. For each drifted app, get the detailed diff
3. Classify the risk: CRITICAL (security changes, replica count to 0, resource limit removal), 
   HIGH (image tag changes, environment variable changes, service port changes), 
   MEDIUM (annotation/label changes, resource adjustments), 
   LOW (metadata-only changes)
4. Determine if the change looks intentional (recent deployment) or accidental (random manual kubectl edit)
5. Provide specific remediation: sync the app, investigate before syncing, or escalate
 
Format your final report as:
## Drift Detection Report
**Timestamp:** [current time]
**Total Drifted Applications:** [count]
 
For each app:
### [App Name]
- **Risk Level:** CRITICAL/HIGH/MEDIUM/LOW
- **What Changed:** [specific changes]
- **Why It Matters:** [business/security impact]
- **Recommended Action:** [exact steps]
"""),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}")
])
 
tools = [get_drifted_applications, get_application_drift_details, classify_drift_risk]
 
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    max_iterations=10
)
 
def run_drift_analysis() -> str:
    result = agent_executor.invoke({
        "input": "Analyze all drifted ArgoCD applications. For each one, determine what changed, assess the risk level, and provide specific remediation steps."
    })
    return result["output"]

Step 3: Main Runner with Notifications

python
# main.py
import os
import json
from datetime import datetime
from dotenv import load_dotenv
from drift_analyzer import run_drift_analysis
 
load_dotenv()
 
def send_slack_notification(report: str, webhook_url: str):
    import requests
    payload = {
        "text": f":rotating_light: *GitOps Drift Detected*\n```{report[:2800]}```"
    }
    requests.post(webhook_url, json=payload)
 
def save_report(report: str):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"drift_report_{timestamp}.md"
    with open(filename, "w") as f:
        f.write(report)
    print(f"Report saved: {filename}")
 
if __name__ == "__main__":
    print("Starting GitOps Drift Analysis...")
    
    report = run_drift_analysis()
    
    print("\n" + "="*60)
    print("DRIFT ANALYSIS REPORT")
    print("="*60)
    print(report)
    
    # Save report
    save_report(report)
    
    # Send to Slack if webhook configured
    slack_webhook = os.getenv("SLACK_WEBHOOK_URL")
    if slack_webhook and "CRITICAL" in report or "HIGH" in report:
        send_slack_notification(report, slack_webhook)
        print("Alert sent to Slack")

Step 4: Run as a Kubernetes CronJob

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: gitops-drift-detector
  namespace: monitoring
spec:
  schedule: "*/15 * * * *"  # Every 15 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: drift-detector
            image: your-registry/drift-detector:latest
            env:
            - name: ARGOCD_SERVER_URL
              value: "https://argocd.internal"
            - name: ARGOCD_TOKEN
              valueFrom:
                secretKeyRef:
                  name: drift-detector-secrets
                  key: argocd-token
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: drift-detector-secrets
                  key: anthropic-api-key
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: drift-detector-secrets
                  key: slack-webhook
          restartPolicy: OnFailure

Sample Output

markdown
## Drift Detection Report
**Timestamp:** 2026-06-12 14:23:11
**Total Drifted Applications:** 2
 
### production-api
- **Risk Level:** HIGH
- **What Changed:** Container image tag changed from `v2.1.4` to `latest` 
  in the live cluster. This was not reflected in Git.
- **Why It Matters:** Using `latest` tag in production means you cannot 
  reproduce this deployment. If the pod restarts, it may pull a different 
  image. This is a reliability and security risk.
- **Recommended Action:** 
  1. Immediately check who ran `kubectl set image` — look in audit logs
  2. Do NOT sync ArgoCD (this would revert to v2.1.4 which may be correct)
  3. Update Git to reflect the intended image tag
  4. Then sync to restore GitOps control
 
### staging-worker
- **Risk Level:** LOW  
- **What Changed:** Annotation `kubectl.kubernetes.io/last-applied-configuration` 
  updated. No functional changes detected.
- **Why It Matters:** This is a metadata-only change, likely from a manual 
  kubectl apply during testing.
- **Recommended Action:** Sync the application to restore Git state.
  Run: `argocd app sync staging-worker`

What This Agent Does That ArgoCD Can't

ArgoCD tells you: "App X is OutOfSync." That's it.

The agent tells you:

  • What specifically changed (image tag, replica count, environment variable)
  • Why that specific change matters (security risk, reliability issue, compliance concern)
  • Whether to sync immediately or investigate first
  • How to fix it with exact commands

For teams running 50+ ArgoCD applications, the signal-to-noise ratio of drift alerts goes from overwhelming to actionable.

Build your ArgoCD GitOps pipelines: CI/CD Pipeline Generator

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments