šŸŽ‰ DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an AI Helm Values Validator with Claude API

Helm's schema validation catches type errors but misses logical mistakes — a memory limit lower than the request, a missing resource block, an image tag that's 'latest' in production. Build a smarter validator with Claude API.

DevOpsBoysJun 17, 20264 min read
Share:Tweet

helm lint and JSON Schema validation catch structural problems — wrong types, missing required fields. They don't catch logical problems: a values.yaml that's internally consistent and schema-valid but still wrong, like setting resources.limits.memory lower than resources.requests.memory, or shipping image.tag: latest to a production values file.

Let's build a validator that catches the logical mistakes a schema can't express.

Why JSON Schema Isn't Enough

yaml
# values-production.yaml — passes helm lint, passes JSON Schema validation, 
# still has 3 real problems
replicaCount: 1                    # no HA for "production"
image:
  repository: myapp
  tag: latest                       # never pin "latest" in production
resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"                 # LOWER than request — will OOMKill immediately
    cpu: "1000m"

A JSON Schema can enforce that resources.limits.memory exists and is a string matching a memory format. It can't easily express "must be numerically greater than or equal to requests.memory" without significant custom schema tooling, and it definitely can't flag "replicaCount of 1 is unusual for a file named values-production.yaml."

Step 1: Render the Full Values Context

python
import subprocess
import yaml
 
def get_resolved_values(chart_path: str, values_file: str) -> dict:
    """Get the fully merged values — defaults + overrides — as Helm actually sees them."""
    result = subprocess.run(
        ["helm", "show", "values", chart_path, "-f", values_file],
        capture_output=True, text=True
    )
    return yaml.safe_load(result.stdout)
 
def get_rendered_manifests(chart_path: str, values_file: str, release_name: str) -> str:
    result = subprocess.run(
        ["helm", "template", release_name, chart_path, "-f", values_file],
        capture_output=True, text=True
    )
    return result.stdout

Sending Claude the resolved values is better than sending the raw values.yaml alone — many real mistakes only become visible once defaults and overrides are merged.

Step 2: Build the Validation Prompt

python
from anthropic import Anthropic
import json
 
client = Anthropic()
 
def validate_helm_values(values_file_name: str, resolved_values: dict, rendered_manifests: str) -> dict:
    prompt = f"""You are a senior platform engineer reviewing a Helm values file 
named "{values_file_name}" before it's applied.
 
RESOLVED VALUES (after merging defaults + overrides):
{json.dumps(resolved_values, indent=2)}
 
RENDERED MANIFESTS (what Helm will actually apply):
{rendered_manifests[:6000]}
 
Check for these categories of problems:
1. Resource sanity: limits lower than requests, missing limits entirely, 
   suspiciously low/high values for the apparent workload type
2. Production readiness mismatches: if the filename suggests "production" or 
   "prod", flag low replica counts, "latest" image tags, missing PodDisruptionBudget, 
   missing liveness/readiness probes, missing anti-affinity rules
3. Security: privileged containers, missing securityContext, hostNetwork/hostPID 
   enabled without clear justification, secrets passed as plain env vars
4. Common typos/logic errors: a value that looks like it was meant to reference 
   another field but is hardcoded, inconsistent naming patterns suggesting copy-paste 
   errors from another environment's values
 
For each issue found, return:
- severity: critical, warning, or info
- field: the specific YAML path
- issue: what's wrong
- suggestion: the specific fix
 
Return as JSON: {{"issues": [...]}}. If nothing is wrong, return {{"issues": []}}."""
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return json.loads(response.content[0].text)

Step 3: Wire It Into CI as a Pre-Merge Gate

yaml
# .github/workflows/helm-validate.yml
name: Validate Helm Values
on:
  pull_request:
    paths:
      - 'helm/**/values*.yaml'
 
jobs:
  ai-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run AI values validator
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python scripts/validate_helm_values.py \
            --chart ./helm/myapp \
            --values ./helm/myapp/values-production.yaml \
            --fail-on critical
python
# scripts/validate_helm_values.py
import sys
import argparse
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--chart", required=True)
    parser.add_argument("--values", required=True)
    parser.add_argument("--fail-on", default="critical", choices=["critical", "warning"])
    args = parser.parse_args()
    
    resolved = get_resolved_values(args.chart, args.values)
    manifests = get_rendered_manifests(args.chart, args.values, "ci-validation")
    result = validate_helm_values(args.values, resolved, manifests)
    
    severity_order = {"critical": 0, "warning": 1, "info": 2}
    fail_threshold = severity_order[args.fail_on]
    
    should_fail = False
    for issue in result["issues"]:
        marker = "šŸ”“" if issue["severity"] == "critical" else "🟔" if issue["severity"] == "warning" else "ā„¹ļø"
        print(f"{marker} [{issue['severity'].upper()}] {issue['field']}: {issue['issue']}")
        print(f"   → {issue['suggestion']}")
        
        if severity_order.get(issue["severity"], 2) <= fail_threshold:
            should_fail = True
    
    if should_fail:
        print(f"\nValidation failed — issues at or above '{args.fail_on}' severity found.")
        sys.exit(1)
    
    print("\nNo blocking issues found.")
 
if __name__ == "__main__":
    main()

Sample Output on the Broken Example Above

šŸ”“ [CRITICAL] resources.limits.memory: Limit (256Mi) is lower than request (512Mi) 
   — this pod will be OOMKilled almost immediately on startup since Kubernetes 
   uses the request for scheduling but enforces the limit as a hard ceiling.
   → Set limits.memory to at least 512Mi, ideally with headroom (e.g. 768Mi)

🟔 [WARNING] image.tag: "latest" used in a file named values-production.yaml
   → Pin to a specific version tag or SHA for production deployments — "latest" 
     makes rollbacks unpredictable since the tag's target can silently change

🟔 [WARNING] replicaCount: Set to 1 in a production values file
   → Consider at least 2-3 replicas for high availability, combined with a 
     PodDisruptionBudget to prevent all replicas being evicted simultaneously

This catches exactly the kind of mistake that passes every automated lint check and still causes a 2 AM page — the validator earns its keep on logical correctness, not syntax, which is the gap schema-based tools structurally can't close.

Generate Helm charts with AI too: Build a Helm Chart Generator with AI

šŸ”§

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments