🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build a Kubernetes Cost Optimizer with Claude API + Prometheus Metrics

Your Kubernetes cluster is probably wasting 40-60% of its compute cost on over-provisioned resources. Build an AI-powered cost optimizer that reads Prometheus metrics and gives specific rightsizing recommendations.

DevOpsBoysJun 13, 20267 min read
Share:Tweet

Studies consistently show that 40-60% of Kubernetes compute costs come from over-provisioned resources. Engineers set CPU requests to "1 core" for a service that actually uses 0.1 cores, and it stays that way for years because nobody goes back to check.

Let's build an AI optimizer that reads your actual Prometheus metrics and generates specific, actionable rightsizing recommendations — with cost savings estimates.

What We're Building

Prometheus API → Real usage data → Claude API → Specific recommendations + savings estimate

The optimizer will:

  1. Query Prometheus for actual CPU and memory usage per deployment
  2. Compare actual usage against requested/limits
  3. Identify over-provisioned and under-provisioned resources
  4. Calculate potential cost savings
  5. Generate specific kubectl commands to implement the recommendations

Prerequisites

bash
pip install anthropic requests python-dotenv tabulate

You need:

  • Prometheus running in your cluster (accessible via HTTP)
  • Anthropic API key
  • kubectl access to your cluster

Step 1: Prometheus Query Engine

python
# prometheus_client.py
import requests
from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timedelta
 
@dataclass
class WorkloadMetrics:
    namespace: str
    deployment: str
    
    # Actual usage (from Prometheus)
    avg_cpu_cores: float
    p95_cpu_cores: float
    max_cpu_cores: float
    
    avg_memory_bytes: float
    p95_memory_bytes: float
    max_memory_bytes: float
    
    # Configured (from kube_* metrics)
    cpu_request_cores: float
    cpu_limit_cores: float
    memory_request_bytes: float
    memory_limit_bytes: float
    
    replicas: int
 
class PrometheusClient:
    def __init__(self, url: str):
        self.url = url.rstrip('/')
    
    def query(self, promql: str) -> list:
        resp = requests.get(
            f"{self.url}/api/v1/query",
            params={"query": promql},
            timeout=30
        )
        resp.raise_for_status()
        data = resp.json()
        return data.get("data", {}).get("result", [])
    
    def query_range(self, promql: str, hours: int = 24) -> list:
        end = datetime.now()
        start = end - timedelta(hours=hours)
        resp = requests.get(
            f"{self.url}/api/v1/query_range",
            params={
                "query": promql,
                "start": start.timestamp(),
                "end": end.timestamp(),
                "step": "5m"
            },
            timeout=60
        )
        resp.raise_for_status()
        data = resp.json()
        return data.get("data", {}).get("result", [])
    
    def get_all_deployment_metrics(self, namespace: str = None) -> list[WorkloadMetrics]:
        ns_filter = f', namespace="{namespace}"' if namespace else ''
        
        # Average CPU usage over last 24h
        avg_cpu = self.query(
            f'avg by (namespace, pod) (rate(container_cpu_usage_seconds_total{{container!=""{ns_filter}}}[24h]))'
        )
        
        # p95 CPU usage
        p95_cpu = self.query(
            f'quantile by (namespace) (0.95, rate(container_cpu_usage_seconds_total{{container!=""{ns_filter}}}[5m]))'
        )
        
        # CPU requests configured
        cpu_requests = self.query(
            f'avg by (namespace, pod) (kube_pod_container_resource_requests{{resource="cpu"{ns_filter}}})'
        )
        
        # Memory usage
        avg_mem = self.query(
            f'avg by (namespace, pod) (container_memory_working_set_bytes{{container!=""{ns_filter}}})'
        )
        
        # Memory requests
        mem_requests = self.query(
            f'avg by (namespace, pod) (kube_pod_container_resource_requests{{resource="memory"{ns_filter}}})'
        )
        
        # Build deployment-level metrics
        # (simplified — in practice you'd join these queries by pod → deployment)
        deployments = self.query(
            f'kube_deployment_spec_replicas{{{ns_filter.lstrip(",").strip()}}}'
        )
        
        results = []
        for dep in deployments:
            ns = dep["metric"]["namespace"]
            name = dep["metric"]["deployment"]
            replicas = int(dep["value"][1])
            
            # Find matching metrics (simplified lookup)
            results.append(WorkloadMetrics(
                namespace=ns,
                deployment=name,
                avg_cpu_cores=self._find_value(avg_cpu, ns, name, 0.05),
                p95_cpu_cores=self._find_value(p95_cpu, ns, name, 0.1),
                max_cpu_cores=self._find_value(avg_cpu, ns, name, 0.15),
                avg_memory_bytes=self._find_value(avg_mem, ns, name, 50_000_000),
                p95_memory_bytes=self._find_value(avg_mem, ns, name, 100_000_000),
                max_memory_bytes=self._find_value(avg_mem, ns, name, 150_000_000),
                cpu_request_cores=self._find_value(cpu_requests, ns, name, 0.5),
                cpu_limit_cores=self._find_value(cpu_requests, ns, name, 1.0),
                memory_request_bytes=self._find_value(mem_requests, ns, name, 256_000_000),
                memory_limit_bytes=self._find_value(mem_requests, ns, name, 512_000_000),
                replicas=replicas
            ))
        return results
    
    def _find_value(self, results: list, namespace: str, deployment: str, default: float) -> float:
        for r in results:
            if r["metric"].get("namespace") == namespace:
                try:
                    return float(r["value"][1])
                except (KeyError, IndexError, ValueError):
                    pass
        return default

Step 2: Cost Calculator

python
# cost_calculator.py
 
# AWS EC2 approximate costs per core/GB per hour (adjust for your cloud)
COST_PER_CORE_HOUR = 0.048   # ~$0.048/vCPU/hour (m5.xlarge average)
COST_PER_GB_HOUR = 0.006     # ~$0.006/GB RAM/hour
 
def bytes_to_gb(b: float) -> float:
    return b / (1024 ** 3)
 
def calculate_waste(metrics) -> dict:
    # CPU waste per replica per hour
    cpu_waste_cores = metrics.cpu_request_cores - (metrics.p95_cpu_cores * 1.2)  # 20% headroom
    cpu_waste_cores = max(0, cpu_waste_cores)
    
    # Memory waste per replica per hour
    mem_waste_gb = bytes_to_gb(metrics.memory_request_bytes) - (bytes_to_gb(metrics.p95_memory_bytes) * 1.2)
    mem_waste_gb = max(0, mem_waste_gb)
    
    # Monthly cost savings (hours * replicas * 730 hours/month)
    cpu_savings_monthly = cpu_waste_cores * metrics.replicas * COST_PER_CORE_HOUR * 730
    mem_savings_monthly = mem_waste_gb * metrics.replicas * COST_PER_GB_HOUR * 730
    
    # Recommended values (p95 + 20% headroom, rounded up)
    recommended_cpu_request = round(metrics.p95_cpu_cores * 1.2, 3)
    recommended_cpu_limit = round(metrics.p95_cpu_cores * 2.0, 3)  # 2x p95 for burst
    recommended_mem_request_mb = int(bytes_to_gb(metrics.p95_memory_bytes) * 1.2 * 1024)
    recommended_mem_limit_mb = int(bytes_to_gb(metrics.max_memory_bytes) * 1.5 * 1024)
    
    return {
        "cpu_waste_cores": cpu_waste_cores,
        "mem_waste_gb": mem_waste_gb,
        "monthly_savings_usd": cpu_savings_monthly + mem_savings_monthly,
        "recommended_cpu_request": f"{recommended_cpu_request}",
        "recommended_cpu_limit": f"{recommended_cpu_limit}",
        "recommended_mem_request": f"{recommended_mem_request_mb}Mi",
        "recommended_mem_limit": f"{recommended_mem_limit_mb}Mi",
        "utilization_pct": (metrics.avg_cpu_cores / metrics.cpu_request_cores * 100) if metrics.cpu_request_cores > 0 else 0
    }

Step 3: Claude API Analysis

python
# optimizer.py
import os
import json
from anthropic import Anthropic
from tabulate import tabulate
from prometheus_client import PrometheusClient
from cost_calculator import calculate_waste, bytes_to_gb
 
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
def format_metrics_for_claude(metrics_list: list, waste_data: list) -> str:
    rows = []
    for m, w in zip(metrics_list, waste_data):
        rows.append([
            f"{m.namespace}/{m.deployment}",
            m.replicas,
            f"{m.cpu_request_cores:.3f}",
            f"{m.avg_cpu_cores:.3f}",
            f"{m.p95_cpu_cores:.3f}",
            f"{bytes_to_gb(m.memory_request_bytes):.2f}GB",
            f"{bytes_to_gb(m.avg_memory_bytes):.2f}GB",
            f"{w['utilization_pct']:.1f}%",
            f"${w['monthly_savings_usd']:.2f}"
        ])
    
    return tabulate(rows, headers=[
        "Deployment", "Replicas", "CPU Req", "CPU Avg", "CPU p95",
        "Mem Req", "Mem Avg", "CPU Util%", "Est. Savings/mo"
    ], tablefmt="pipe")
 
def generate_kubectl_command(m, w) -> str:
    return f"""kubectl set resources deployment {m.deployment} \\
  -n {m.namespace} \\
  --requests=cpu={w['recommended_cpu_request']},memory={w['recommended_mem_request']} \\
  --limits=cpu={w['recommended_cpu_limit']},memory={w['recommended_mem_limit']}"""
 
def analyze_with_claude(metrics_list: list) -> str:
    waste_data = [calculate_waste(m) for m in metrics_list]
    total_savings = sum(w["monthly_savings_usd"] for w in waste_data)
    
    metrics_table = format_metrics_for_claude(metrics_list, waste_data)
    
    kubectl_commands = "\n\n".join([
        f"# {m.namespace}/{m.deployment} (saves ${w['monthly_savings_usd']:.2f}/mo)\n{generate_kubectl_command(m, w)}"
        for m, w in zip(metrics_list, waste_data)
        if w["monthly_savings_usd"] > 10  # Only show significant savings
    ])
    
    prompt = f"""You are a senior FinOps engineer specializing in Kubernetes cost optimization.
 
Analyze the following Kubernetes workload metrics from the last 24 hours:
 
{metrics_table}
 
Total estimated monthly savings if all recommendations applied: ${total_savings:.2f}
 
Generated kubectl commands for high-impact changes:
{kubectl_commands}
 
Provide:
1. Executive summary (2-3 sentences on overall cluster efficiency)
2. Top 3-5 highest-impact rightsizing opportunities with specific reasoning
3. Any concerning under-provisioned workloads (where actual usage is close to or exceeding requests — risk of OOM or CPU throttling)
4. Recommended implementation order (what to change first, what to monitor)
5. Any patterns you notice (e.g., "staging namespace is consistently over-provisioned by 5x")
 
Be specific with numbers. Flag anything that looks unusual or risky.
"""
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text
 
def run_optimization(prometheus_url: str, namespace: str = None):
    prom = PrometheusClient(prometheus_url)
    
    print("Fetching metrics from Prometheus...")
    metrics = prom.get_all_deployment_metrics(namespace=namespace)
    
    if not metrics:
        print("No deployment metrics found. Check Prometheus connectivity.")
        return
    
    print(f"Analyzing {len(metrics)} deployments...")
    analysis = analyze_with_claude(metrics)
    
    print("\n" + "="*60)
    print("KUBERNETES COST OPTIMIZATION REPORT")
    print("="*60)
    print(analysis)
    
    # Generate a kubectl patch file for all recommendations
    waste_data = [calculate_waste(m) for m in metrics]
    total = sum(w["monthly_savings_usd"] for w in waste_data)
    
    print(f"\n{'='*60}")
    print(f"ESTIMATED TOTAL MONTHLY SAVINGS: ${total:.2f}")
    print(f"{'='*60}")
 
if __name__ == "__main__":
    run_optimization(
        prometheus_url=os.getenv("PROMETHEUS_URL", "http://localhost:9090"),
        namespace=os.getenv("TARGET_NAMESPACE")  # None = all namespaces
    )

Sample Output

============================================================
KUBERNETES COST OPTIMIZATION REPORT
============================================================

**Executive Summary**

Your cluster is running at approximately 18% CPU utilization across 
all workloads — meaning 82% of requested CPU is never used. Memory 
utilization is better at 45%, but several services have requests 
set to 2x their actual peak usage. Total identified savings: $847/month.

**Top Rightsizing Opportunities**

1. **api-gateway (production)** — saves $312/month
   CPU request: 2 cores → 0.2 cores (actual p95: 0.16 cores)
   This was likely set during initial deployment and never reviewed.
   With 8 replicas, this single change saves the most.

2. **worker-processor (production)** — saves $189/month  
   Memory request: 4GB → 512MB (actual p95: 380MB)
   The 4GB request is 10x actual usage. Safe to reduce aggressively.

3. **image-resizer (staging)** — saves $94/month
   CPU request: 1 core → 0.15 cores (actual p95: 0.12 cores)
   Staging should mirror production sizing — right-size staging first.

**Under-Provisioned Workloads (Immediate Risk)**

⚠️  **payment-service (production)** — CPU THROTTLING RISK
   CPU p95 usage (0.94 cores) is at 94% of request (1 core).
   Under traffic spikes, this service will be CPU throttled.
   Recommend increasing CPU limit to 2 cores minimum.

⚠️  **auth-service (production)** — OOM RISK  
   Memory p95 (1.85GB) is at 92% of request (2GB).
   One memory spike will trigger OOMKilled. Increase to 3GB.

Running as a Weekly CronJob

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cost-optimizer
  namespace: monitoring
spec:
  schedule: "0 9 * * 1"  # Every Monday at 9 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: optimizer
            image: your-registry/cost-optimizer:latest
            env:
            - name: PROMETHEUS_URL
              value: "http://prometheus-operated.monitoring:9090"
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: ai-secrets
                  key: anthropic-api-key
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: ai-secrets
                  key: slack-webhook
          restartPolicy: OnFailure

The weekly cadence matters — resources drift over time as teams adjust workloads but forget to tune resource requests.

Calculate proper resource sizing before deploying: Kubernetes Resource Calculator

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments