Build a Kubernetes Cost Optimizer with Claude API + Prometheus Metrics
Your Kubernetes cluster is probably wasting 40-60% of its compute cost on over-provisioned resources. Build an AI-powered cost optimizer that reads Prometheus metrics and gives specific rightsizing recommendations.
Studies consistently show that 40-60% of Kubernetes compute costs come from over-provisioned resources. Engineers set CPU requests to "1 core" for a service that actually uses 0.1 cores, and it stays that way for years because nobody goes back to check.
Let's build an AI optimizer that reads your actual Prometheus metrics and generates specific, actionable rightsizing recommendations — with cost savings estimates.
What We're Building
Prometheus API → Real usage data → Claude API → Specific recommendations + savings estimate
The optimizer will:
- Query Prometheus for actual CPU and memory usage per deployment
- Compare actual usage against requested/limits
- Identify over-provisioned and under-provisioned resources
- Calculate potential cost savings
- Generate specific
kubectlcommands to implement the recommendations
Prerequisites
pip install anthropic requests python-dotenv tabulateYou need:
- Prometheus running in your cluster (accessible via HTTP)
- Anthropic API key
kubectlaccess to your cluster
Step 1: Prometheus Query Engine
# prometheus_client.py
import requests
from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timedelta
@dataclass
class WorkloadMetrics:
namespace: str
deployment: str
# Actual usage (from Prometheus)
avg_cpu_cores: float
p95_cpu_cores: float
max_cpu_cores: float
avg_memory_bytes: float
p95_memory_bytes: float
max_memory_bytes: float
# Configured (from kube_* metrics)
cpu_request_cores: float
cpu_limit_cores: float
memory_request_bytes: float
memory_limit_bytes: float
replicas: int
class PrometheusClient:
def __init__(self, url: str):
self.url = url.rstrip('/')
def query(self, promql: str) -> list:
resp = requests.get(
f"{self.url}/api/v1/query",
params={"query": promql},
timeout=30
)
resp.raise_for_status()
data = resp.json()
return data.get("data", {}).get("result", [])
def query_range(self, promql: str, hours: int = 24) -> list:
end = datetime.now()
start = end - timedelta(hours=hours)
resp = requests.get(
f"{self.url}/api/v1/query_range",
params={
"query": promql,
"start": start.timestamp(),
"end": end.timestamp(),
"step": "5m"
},
timeout=60
)
resp.raise_for_status()
data = resp.json()
return data.get("data", {}).get("result", [])
def get_all_deployment_metrics(self, namespace: str = None) -> list[WorkloadMetrics]:
ns_filter = f', namespace="{namespace}"' if namespace else ''
# Average CPU usage over last 24h
avg_cpu = self.query(
f'avg by (namespace, pod) (rate(container_cpu_usage_seconds_total{{container!=""{ns_filter}}}[24h]))'
)
# p95 CPU usage
p95_cpu = self.query(
f'quantile by (namespace) (0.95, rate(container_cpu_usage_seconds_total{{container!=""{ns_filter}}}[5m]))'
)
# CPU requests configured
cpu_requests = self.query(
f'avg by (namespace, pod) (kube_pod_container_resource_requests{{resource="cpu"{ns_filter}}})'
)
# Memory usage
avg_mem = self.query(
f'avg by (namespace, pod) (container_memory_working_set_bytes{{container!=""{ns_filter}}})'
)
# Memory requests
mem_requests = self.query(
f'avg by (namespace, pod) (kube_pod_container_resource_requests{{resource="memory"{ns_filter}}})'
)
# Build deployment-level metrics
# (simplified — in practice you'd join these queries by pod → deployment)
deployments = self.query(
f'kube_deployment_spec_replicas{{{ns_filter.lstrip(",").strip()}}}'
)
results = []
for dep in deployments:
ns = dep["metric"]["namespace"]
name = dep["metric"]["deployment"]
replicas = int(dep["value"][1])
# Find matching metrics (simplified lookup)
results.append(WorkloadMetrics(
namespace=ns,
deployment=name,
avg_cpu_cores=self._find_value(avg_cpu, ns, name, 0.05),
p95_cpu_cores=self._find_value(p95_cpu, ns, name, 0.1),
max_cpu_cores=self._find_value(avg_cpu, ns, name, 0.15),
avg_memory_bytes=self._find_value(avg_mem, ns, name, 50_000_000),
p95_memory_bytes=self._find_value(avg_mem, ns, name, 100_000_000),
max_memory_bytes=self._find_value(avg_mem, ns, name, 150_000_000),
cpu_request_cores=self._find_value(cpu_requests, ns, name, 0.5),
cpu_limit_cores=self._find_value(cpu_requests, ns, name, 1.0),
memory_request_bytes=self._find_value(mem_requests, ns, name, 256_000_000),
memory_limit_bytes=self._find_value(mem_requests, ns, name, 512_000_000),
replicas=replicas
))
return results
def _find_value(self, results: list, namespace: str, deployment: str, default: float) -> float:
for r in results:
if r["metric"].get("namespace") == namespace:
try:
return float(r["value"][1])
except (KeyError, IndexError, ValueError):
pass
return defaultStep 2: Cost Calculator
# cost_calculator.py
# AWS EC2 approximate costs per core/GB per hour (adjust for your cloud)
COST_PER_CORE_HOUR = 0.048 # ~$0.048/vCPU/hour (m5.xlarge average)
COST_PER_GB_HOUR = 0.006 # ~$0.006/GB RAM/hour
def bytes_to_gb(b: float) -> float:
return b / (1024 ** 3)
def calculate_waste(metrics) -> dict:
# CPU waste per replica per hour
cpu_waste_cores = metrics.cpu_request_cores - (metrics.p95_cpu_cores * 1.2) # 20% headroom
cpu_waste_cores = max(0, cpu_waste_cores)
# Memory waste per replica per hour
mem_waste_gb = bytes_to_gb(metrics.memory_request_bytes) - (bytes_to_gb(metrics.p95_memory_bytes) * 1.2)
mem_waste_gb = max(0, mem_waste_gb)
# Monthly cost savings (hours * replicas * 730 hours/month)
cpu_savings_monthly = cpu_waste_cores * metrics.replicas * COST_PER_CORE_HOUR * 730
mem_savings_monthly = mem_waste_gb * metrics.replicas * COST_PER_GB_HOUR * 730
# Recommended values (p95 + 20% headroom, rounded up)
recommended_cpu_request = round(metrics.p95_cpu_cores * 1.2, 3)
recommended_cpu_limit = round(metrics.p95_cpu_cores * 2.0, 3) # 2x p95 for burst
recommended_mem_request_mb = int(bytes_to_gb(metrics.p95_memory_bytes) * 1.2 * 1024)
recommended_mem_limit_mb = int(bytes_to_gb(metrics.max_memory_bytes) * 1.5 * 1024)
return {
"cpu_waste_cores": cpu_waste_cores,
"mem_waste_gb": mem_waste_gb,
"monthly_savings_usd": cpu_savings_monthly + mem_savings_monthly,
"recommended_cpu_request": f"{recommended_cpu_request}",
"recommended_cpu_limit": f"{recommended_cpu_limit}",
"recommended_mem_request": f"{recommended_mem_request_mb}Mi",
"recommended_mem_limit": f"{recommended_mem_limit_mb}Mi",
"utilization_pct": (metrics.avg_cpu_cores / metrics.cpu_request_cores * 100) if metrics.cpu_request_cores > 0 else 0
}Step 3: Claude API Analysis
# optimizer.py
import os
import json
from anthropic import Anthropic
from tabulate import tabulate
from prometheus_client import PrometheusClient
from cost_calculator import calculate_waste, bytes_to_gb
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def format_metrics_for_claude(metrics_list: list, waste_data: list) -> str:
rows = []
for m, w in zip(metrics_list, waste_data):
rows.append([
f"{m.namespace}/{m.deployment}",
m.replicas,
f"{m.cpu_request_cores:.3f}",
f"{m.avg_cpu_cores:.3f}",
f"{m.p95_cpu_cores:.3f}",
f"{bytes_to_gb(m.memory_request_bytes):.2f}GB",
f"{bytes_to_gb(m.avg_memory_bytes):.2f}GB",
f"{w['utilization_pct']:.1f}%",
f"${w['monthly_savings_usd']:.2f}"
])
return tabulate(rows, headers=[
"Deployment", "Replicas", "CPU Req", "CPU Avg", "CPU p95",
"Mem Req", "Mem Avg", "CPU Util%", "Est. Savings/mo"
], tablefmt="pipe")
def generate_kubectl_command(m, w) -> str:
return f"""kubectl set resources deployment {m.deployment} \\
-n {m.namespace} \\
--requests=cpu={w['recommended_cpu_request']},memory={w['recommended_mem_request']} \\
--limits=cpu={w['recommended_cpu_limit']},memory={w['recommended_mem_limit']}"""
def analyze_with_claude(metrics_list: list) -> str:
waste_data = [calculate_waste(m) for m in metrics_list]
total_savings = sum(w["monthly_savings_usd"] for w in waste_data)
metrics_table = format_metrics_for_claude(metrics_list, waste_data)
kubectl_commands = "\n\n".join([
f"# {m.namespace}/{m.deployment} (saves ${w['monthly_savings_usd']:.2f}/mo)\n{generate_kubectl_command(m, w)}"
for m, w in zip(metrics_list, waste_data)
if w["monthly_savings_usd"] > 10 # Only show significant savings
])
prompt = f"""You are a senior FinOps engineer specializing in Kubernetes cost optimization.
Analyze the following Kubernetes workload metrics from the last 24 hours:
{metrics_table}
Total estimated monthly savings if all recommendations applied: ${total_savings:.2f}
Generated kubectl commands for high-impact changes:
{kubectl_commands}
Provide:
1. Executive summary (2-3 sentences on overall cluster efficiency)
2. Top 3-5 highest-impact rightsizing opportunities with specific reasoning
3. Any concerning under-provisioned workloads (where actual usage is close to or exceeding requests — risk of OOM or CPU throttling)
4. Recommended implementation order (what to change first, what to monitor)
5. Any patterns you notice (e.g., "staging namespace is consistently over-provisioned by 5x")
Be specific with numbers. Flag anything that looks unusual or risky.
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def run_optimization(prometheus_url: str, namespace: str = None):
prom = PrometheusClient(prometheus_url)
print("Fetching metrics from Prometheus...")
metrics = prom.get_all_deployment_metrics(namespace=namespace)
if not metrics:
print("No deployment metrics found. Check Prometheus connectivity.")
return
print(f"Analyzing {len(metrics)} deployments...")
analysis = analyze_with_claude(metrics)
print("\n" + "="*60)
print("KUBERNETES COST OPTIMIZATION REPORT")
print("="*60)
print(analysis)
# Generate a kubectl patch file for all recommendations
waste_data = [calculate_waste(m) for m in metrics]
total = sum(w["monthly_savings_usd"] for w in waste_data)
print(f"\n{'='*60}")
print(f"ESTIMATED TOTAL MONTHLY SAVINGS: ${total:.2f}")
print(f"{'='*60}")
if __name__ == "__main__":
run_optimization(
prometheus_url=os.getenv("PROMETHEUS_URL", "http://localhost:9090"),
namespace=os.getenv("TARGET_NAMESPACE") # None = all namespaces
)Sample Output
============================================================
KUBERNETES COST OPTIMIZATION REPORT
============================================================
**Executive Summary**
Your cluster is running at approximately 18% CPU utilization across
all workloads — meaning 82% of requested CPU is never used. Memory
utilization is better at 45%, but several services have requests
set to 2x their actual peak usage. Total identified savings: $847/month.
**Top Rightsizing Opportunities**
1. **api-gateway (production)** — saves $312/month
CPU request: 2 cores → 0.2 cores (actual p95: 0.16 cores)
This was likely set during initial deployment and never reviewed.
With 8 replicas, this single change saves the most.
2. **worker-processor (production)** — saves $189/month
Memory request: 4GB → 512MB (actual p95: 380MB)
The 4GB request is 10x actual usage. Safe to reduce aggressively.
3. **image-resizer (staging)** — saves $94/month
CPU request: 1 core → 0.15 cores (actual p95: 0.12 cores)
Staging should mirror production sizing — right-size staging first.
**Under-Provisioned Workloads (Immediate Risk)**
⚠️ **payment-service (production)** — CPU THROTTLING RISK
CPU p95 usage (0.94 cores) is at 94% of request (1 core).
Under traffic spikes, this service will be CPU throttled.
Recommend increasing CPU limit to 2 cores minimum.
⚠️ **auth-service (production)** — OOM RISK
Memory p95 (1.85GB) is at 92% of request (2GB).
One memory spike will trigger OOMKilled. Increase to 3GB.
Running as a Weekly CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: cost-optimizer
namespace: monitoring
spec:
schedule: "0 9 * * 1" # Every Monday at 9 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: optimizer
image: your-registry/cost-optimizer:latest
env:
- name: PROMETHEUS_URL
value: "http://prometheus-operated.monitoring:9090"
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: anthropic-api-key
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: ai-secrets
key: slack-webhook
restartPolicy: OnFailureThe weekly cadence matters — resources drift over time as teams adjust workloads but forget to tune resource requests.
Calculate proper resource sizing before deploying: Kubernetes Resource Calculator
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Build an AI Kubernetes Cost Optimizer with Python and Claude API
Use AI to automatically analyze your Kubernetes resource usage, detect waste, and generate optimization recommendations. Full Python project with Claude API.