Build a GitOps Drift Detector with LangChain + ArgoCD API
ArgoCD tells you when drift happens — but not why it matters or what to do. Build an AI agent with LangChain that detects drift, explains the risk, and suggests the right fix.
ArgoCD already detects drift. It compares your Git state with your cluster state and tells you when they don't match. That part works fine.
What it doesn't do: explain why that drift is dangerous, what changed, whether it was intentional, and what the operator should do next.
This is where an AI agent changes the workflow completely. Let's build a LangChain agent that polls ArgoCD for out-of-sync applications, analyzes the drift, and generates a human-readable incident summary with remediation steps.
What We're Building
ArgoCD API → Drift Detection → LangChain Agent → Analysis + Remediation Report
The agent will:
- Poll ArgoCD API for all applications
- Find out-of-sync applications
- Fetch the diff between desired (Git) and live (cluster) state
- Use an LLM to analyze the diff and classify the risk
- Generate a structured report with: what changed, why it's risky, recommended action
Prerequisites
pip install langchain langchain-anthropic argocd-python-client python-dotenv requestsYou need:
- ArgoCD running with API access
- Anthropic API key (or swap for OpenAI)
- ArgoCD API token
Step 1: ArgoCD API Client
# argocd_client.py
import requests
import json
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class DriftedApp:
name: str
namespace: str
sync_status: str
health_status: str
repo_url: str
target_revision: str
diff: str
class ArgoCDClient:
def __init__(self, server_url: str, token: str, verify_ssl: bool = True):
self.server_url = server_url.rstrip('/')
self.headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
self.verify_ssl = verify_ssl
def get_all_applications(self) -> List[dict]:
resp = requests.get(
f"{self.server_url}/api/v1/applications",
headers=self.headers,
verify=self.verify_ssl
)
resp.raise_for_status()
return resp.json().get("items", [])
def get_application_diff(self, app_name: str) -> str:
resp = requests.get(
f"{self.server_url}/api/v1/applications/{app_name}/resource-tree",
headers=self.headers,
verify=self.verify_ssl
)
resp.raise_for_status()
tree = resp.json()
# Get manifests diff
diff_resp = requests.get(
f"{self.server_url}/api/v1/applications/{app_name}/manifests",
headers=self.headers,
verify=self.verify_ssl
)
if diff_resp.status_code == 200:
return json.dumps(diff_resp.json(), indent=2)[:3000] # Limit size
return "Diff unavailable"
def get_drifted_apps(self) -> List[DriftedApp]:
apps = self.get_all_applications()
drifted = []
for app in apps:
sync_status = app.get("status", {}).get("sync", {}).get("status", "")
if sync_status == "OutOfSync":
name = app["metadata"]["name"]
diff = self.get_application_diff(name)
drifted.append(DriftedApp(
name=name,
namespace=app["metadata"]["namespace"],
sync_status=sync_status,
health_status=app.get("status", {}).get("health", {}).get("status", "Unknown"),
repo_url=app.get("spec", {}).get("source", {}).get("repoURL", ""),
target_revision=app.get("spec", {}).get("source", {}).get("targetRevision", ""),
diff=diff
))
return driftedStep 2: LangChain Agent with Analysis Tools
# drift_analyzer.py
import os
from langchain_anthropic import ChatAnthropic
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from argocd_client import ArgoCDClient, DriftedApp
# Initialize clients
argocd = ArgoCDClient(
server_url=os.getenv("ARGOCD_SERVER_URL"),
token=os.getenv("ARGOCD_TOKEN"),
verify_ssl=False # Set True in production with valid certs
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
api_key=os.getenv("ANTHROPIC_API_KEY"),
max_tokens=4096
)
@tool
def get_drifted_applications() -> str:
"""Fetch all applications from ArgoCD that are out of sync with their Git source."""
drifted = argocd.get_drifted_apps()
if not drifted:
return "No drifted applications found. All applications are in sync."
summary = []
for app in drifted:
summary.append(f"""
Application: {app.name}
Namespace: {app.namespace}
Sync Status: {app.sync_status}
Health Status: {app.health_status}
Repository: {app.repo_url}
Target Revision: {app.target_revision}
""")
return "\n---\n".join(summary)
@tool
def get_application_drift_details(app_name: str) -> str:
"""
Get the detailed diff for a specific application showing what changed
between the desired state (Git) and the live state (cluster).
Args:
app_name: Name of the ArgoCD application to inspect
"""
drifted = argocd.get_drifted_apps()
for app in drifted:
if app.name == app_name:
return f"""
Application: {app.name}
Health: {app.health_status}
Manifest Diff:
{app.diff}
"""
return f"Application {app_name} is not drifted or does not exist."
@tool
def classify_drift_risk(app_name: str, diff_content: str) -> str:
"""
Classify the risk level of detected drift. Use this after getting drift details.
Args:
app_name: Application name
diff_content: The diff content to analyze
"""
# This tool triggers deeper LLM analysis
return f"Risk classification requested for {app_name}. Analyze the diff: {diff_content[:1000]}"
# Define the agent prompt
prompt = ChatPromptTemplate.from_messages([
("system", """You are a senior GitOps engineer with deep expertise in Kubernetes and ArgoCD.
Your job is to detect configuration drift, analyze what changed, assess the risk, and recommend remediation steps.
When analyzing drift:
1. Always start by fetching drifted applications
2. For each drifted app, get the detailed diff
3. Classify the risk: CRITICAL (security changes, replica count to 0, resource limit removal),
HIGH (image tag changes, environment variable changes, service port changes),
MEDIUM (annotation/label changes, resource adjustments),
LOW (metadata-only changes)
4. Determine if the change looks intentional (recent deployment) or accidental (random manual kubectl edit)
5. Provide specific remediation: sync the app, investigate before syncing, or escalate
Format your final report as:
## Drift Detection Report
**Timestamp:** [current time]
**Total Drifted Applications:** [count]
For each app:
### [App Name]
- **Risk Level:** CRITICAL/HIGH/MEDIUM/LOW
- **What Changed:** [specific changes]
- **Why It Matters:** [business/security impact]
- **Recommended Action:** [exact steps]
"""),
("human", "{input}"),
("placeholder", "{agent_scratchpad}")
])
tools = [get_drifted_applications, get_application_drift_details, classify_drift_risk]
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
max_iterations=10
)
def run_drift_analysis() -> str:
result = agent_executor.invoke({
"input": "Analyze all drifted ArgoCD applications. For each one, determine what changed, assess the risk level, and provide specific remediation steps."
})
return result["output"]Step 3: Main Runner with Notifications
# main.py
import os
import json
from datetime import datetime
from dotenv import load_dotenv
from drift_analyzer import run_drift_analysis
load_dotenv()
def send_slack_notification(report: str, webhook_url: str):
import requests
payload = {
"text": f":rotating_light: *GitOps Drift Detected*\n```{report[:2800]}```"
}
requests.post(webhook_url, json=payload)
def save_report(report: str):
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"drift_report_{timestamp}.md"
with open(filename, "w") as f:
f.write(report)
print(f"Report saved: {filename}")
if __name__ == "__main__":
print("Starting GitOps Drift Analysis...")
report = run_drift_analysis()
print("\n" + "="*60)
print("DRIFT ANALYSIS REPORT")
print("="*60)
print(report)
# Save report
save_report(report)
# Send to Slack if webhook configured
slack_webhook = os.getenv("SLACK_WEBHOOK_URL")
if slack_webhook and "CRITICAL" in report or "HIGH" in report:
send_slack_notification(report, slack_webhook)
print("Alert sent to Slack")Step 4: Run as a Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: gitops-drift-detector
namespace: monitoring
spec:
schedule: "*/15 * * * *" # Every 15 minutes
jobTemplate:
spec:
template:
spec:
containers:
- name: drift-detector
image: your-registry/drift-detector:latest
env:
- name: ARGOCD_SERVER_URL
value: "https://argocd.internal"
- name: ARGOCD_TOKEN
valueFrom:
secretKeyRef:
name: drift-detector-secrets
key: argocd-token
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: drift-detector-secrets
key: anthropic-api-key
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: drift-detector-secrets
key: slack-webhook
restartPolicy: OnFailureSample Output
## Drift Detection Report
**Timestamp:** 2026-06-12 14:23:11
**Total Drifted Applications:** 2
### production-api
- **Risk Level:** HIGH
- **What Changed:** Container image tag changed from `v2.1.4` to `latest`
in the live cluster. This was not reflected in Git.
- **Why It Matters:** Using `latest` tag in production means you cannot
reproduce this deployment. If the pod restarts, it may pull a different
image. This is a reliability and security risk.
- **Recommended Action:**
1. Immediately check who ran `kubectl set image` — look in audit logs
2. Do NOT sync ArgoCD (this would revert to v2.1.4 which may be correct)
3. Update Git to reflect the intended image tag
4. Then sync to restore GitOps control
### staging-worker
- **Risk Level:** LOW
- **What Changed:** Annotation `kubectl.kubernetes.io/last-applied-configuration`
updated. No functional changes detected.
- **Why It Matters:** This is a metadata-only change, likely from a manual
kubectl apply during testing.
- **Recommended Action:** Sync the application to restore Git state.
Run: `argocd app sync staging-worker`What This Agent Does That ArgoCD Can't
ArgoCD tells you: "App X is OutOfSync." That's it.
The agent tells you:
- What specifically changed (image tag, replica count, environment variable)
- Why that specific change matters (security risk, reliability issue, compliance concern)
- Whether to sync immediately or investigate first
- How to fix it with exact commands
For teams running 50+ ArgoCD applications, the signal-to-noise ratio of drift alerts goes from overwhelming to actionable.
Build your ArgoCD GitOps pipelines: CI/CD Pipeline Generator
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
ArgoCD vs Flux v2 — Deep Dive Comparison 2026
Both ArgoCD and Flux implement GitOps for Kubernetes, but they take very different approaches. Here's a detailed comparison to help you pick the right one.
ArgoCD vs Flux vs Jenkins — GitOps Comparison 2026
A deep-dive comparison of the three most popular GitOps and CI/CD tools — ArgoCD, Flux CD, and Jenkins. Learn which one fits your team, use case, and Kubernetes setup.
ArgoCD vs Flux vs Spinnaker: Which GitOps Tool in 2026?
ArgoCD, Flux, and Spinnaker all do continuous delivery but in completely different ways. Here's which one to use and when.