🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI Capacity Forecasting Tool with Prophet + Kubernetes Metrics

Reactive autoscaling fixes problems after they happen. Build a forecasting tool using Facebook's Prophet library on historical Prometheus metrics to predict capacity needs days ahead — before traffic spikes hit.

DevOpsBoysJun 16, 20264 min read
Share:Tweet

Karpenter and HPA scale reactively — they respond to load that's already arrived. For predictable patterns (weekday traffic, month-end batch jobs, seasonal spikes), you can do better: forecast the load and pre-scale before it hits, avoiding the cold-start lag of reactive scaling entirely.

Prophet, originally built at Meta for business forecasting, handles seasonality (daily, weekly, yearly patterns) well with minimal tuning — a good fit for infrastructure metrics that follow human usage patterns.

Setup

bash
pip install prophet pandas prometheus-api-client

Step 1: Pull Historical Metrics from Prometheus

python
# fetch_metrics.py
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta
import pandas as pd
 
prom = PrometheusConnect(url="http://prometheus.monitoring:9090", disable_ssl=True)
 
def fetch_cpu_history(namespace: str, days: int = 60) -> pd.DataFrame:
    query = f'sum(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"}}[5m]))'
    
    end_time = datetime.now()
    start_time = end_time - timedelta(days=days)
    
    result = prom.custom_query_range(
        query=query,
        start_time=start_time,
        end_time=end_time,
        step="1h"
    )
    
    timestamps = [float(point[0]) for point in result[0]["values"]]
    values = [float(point[1]) for point in result[0]["values"]]
    
    df = pd.DataFrame({
        "ds": pd.to_datetime(timestamps, unit="s"),  # Prophet requires this exact column name
        "y": values                                    # and this one
    })
    
    return df

At least 4-6 weeks of history is the practical minimum for Prophet to detect weekly seasonality reliably. Less than that and the forecast quality drops noticeably.

Step 2: Build and Run the Forecast

python
from prophet import Prophet
 
def forecast_capacity(df: pd.DataFrame, forecast_days: int = 7) -> pd.DataFrame:
    model = Prophet(
        daily_seasonality=True,
        weekly_seasonality=True,
        yearly_seasonality=False,    # usually not enough history to trust this
        changepoint_prior_scale=0.05  # lower = smoother trend, less overfit to noise
    )
    
    # Add known events that break normal patterns — sales, releases, etc.
    holidays = pd.DataFrame({
        "holiday": ["diwali_sale", "year_end_release"],
        "ds": pd.to_datetime(["2026-10-20", "2026-12-28"]),
        "lower_window": 0,
        "upper_window": 3,
    })
    model.holidays = holidays
    
    model.fit(df)
    
    future = model.make_future_dataframe(periods=forecast_days * 24, freq="h")
    forecast = model.predict(future)
    
    return forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]]

yhat_upper matters more than yhat here — for capacity planning you want the upper confidence bound, not the median prediction, because under-provisioning is more costly than slightly over-provisioning.

Step 3: Convert Forecast Into Actionable Scaling Recommendations

python
def generate_scaling_plan(forecast: pd.DataFrame, current_node_capacity_cpu: float) -> list[dict]:
    plan = []
    future_only = forecast[forecast["ds"] > pd.Timestamp.now()]
    
    # Group by day, find peak predicted usage
    future_only["date"] = future_only["ds"].dt.date
    daily_peaks = future_only.groupby("date")["yhat_upper"].max()
    
    for date, peak_cpu in daily_peaks.items():
        utilization_pct = (peak_cpu / current_node_capacity_cpu) * 100
        
        if utilization_pct > 80:
            plan.append({
                "date": str(date),
                "predicted_peak_cpu": round(peak_cpu, 2),
                "predicted_utilization_pct": round(utilization_pct, 1),
                "recommendation": "Pre-scale node pool before this date — predicted utilization exceeds 80%",
                "suggested_additional_capacity_cpu": round(peak_cpu * 1.2 - current_node_capacity_cpu, 2)
            })
    
    return plan

Step 4: Act on It — Pre-Scaling Karpenter NodePool Limits

python
import subprocess
import yaml
 
def pre_scale_for_forecast(plan: list[dict], days_ahead: int = 1):
    """Run this daily via cron — checks if tomorrow needs pre-scaling."""
    from datetime import date, timedelta
    tomorrow = str(date.today() + timedelta(days=days_ahead))
    
    for entry in plan:
        if entry["date"] == tomorrow:
            print(f"Pre-scaling for predicted load on {tomorrow}: "
                  f"+{entry['suggested_additional_capacity_cpu']} CPU needed")
            
            # Temporarily raise Karpenter NodePool limits ahead of the predicted spike
            subprocess.run([
                "kubectl", "patch", "nodepool", "default",
                "--type=merge",
                "-p", f'{{"spec":{{"limits":{{"cpu":"{entry["predicted_peak_cpu"] * 1.3}"}}}}}}'
            ])

Validating Forecast Accuracy Before You Trust It

Don't act on forecasts you haven't validated against your own historical data first.

python
from prophet.diagnostics import cross_validation, performance_metrics
 
def validate_model(df: pd.DataFrame, model: Prophet):
    df_cv = cross_validation(model, initial="30 days", period="7 days", horizon="7 days")
    metrics = performance_metrics(df_cv)
    
    print(metrics[["horizon", "mape", "coverage"]].tail())
    # mape: mean absolute percentage error — lower is better
    # coverage: % of actual values that fell within your confidence interval — 
    #           should be close to your interval width (default 80%)

If MAPE is above 20-25% on your data, the seasonality patterns in your traffic may be too irregular for Prophet's defaults — try tuning changepoint_prior_scale, or consider that your workload might not have predictable enough patterns for this approach to add value over reactive autoscaling alone.

Where This Pays Off Most

Forecasting-based pre-scaling is most valuable for:

  • Batch/ETL jobs with known recurring schedules (month-end reports, daily reconciliation)
  • E-commerce traffic with predictable daily/weekly cycles and known sale events
  • B2B SaaS with strong business-hours seasonality (near-zero load nights/weekends)

It adds little value for workloads with genuinely random, event-driven spikes (breaking news traffic, viral social posts) — for those, fast reactive scaling (Karpenter's sub-minute provisioning) matters more than forecasting.

Compare this against reactive scaling: Karpenter vs Cluster Autoscaler

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments