Build an AI Helm Values Validator with Claude API
Helm's schema validation catches type errors but misses logical mistakes ā a memory limit lower than the request, a missing resource block, an image tag that's 'latest' in production. Build a smarter validator with Claude API.
helm lint and JSON Schema validation catch structural problems ā wrong types, missing required fields. They don't catch logical problems: a values.yaml that's internally consistent and schema-valid but still wrong, like setting resources.limits.memory lower than resources.requests.memory, or shipping image.tag: latest to a production values file.
Let's build a validator that catches the logical mistakes a schema can't express.
Why JSON Schema Isn't Enough
# values-production.yaml ā passes helm lint, passes JSON Schema validation,
# still has 3 real problems
replicaCount: 1 # no HA for "production"
image:
repository: myapp
tag: latest # never pin "latest" in production
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "256Mi" # LOWER than request ā will OOMKill immediately
cpu: "1000m"A JSON Schema can enforce that resources.limits.memory exists and is a string matching a memory format. It can't easily express "must be numerically greater than or equal to requests.memory" without significant custom schema tooling, and it definitely can't flag "replicaCount of 1 is unusual for a file named values-production.yaml."
Step 1: Render the Full Values Context
import subprocess
import yaml
def get_resolved_values(chart_path: str, values_file: str) -> dict:
"""Get the fully merged values ā defaults + overrides ā as Helm actually sees them."""
result = subprocess.run(
["helm", "show", "values", chart_path, "-f", values_file],
capture_output=True, text=True
)
return yaml.safe_load(result.stdout)
def get_rendered_manifests(chart_path: str, values_file: str, release_name: str) -> str:
result = subprocess.run(
["helm", "template", release_name, chart_path, "-f", values_file],
capture_output=True, text=True
)
return result.stdoutSending Claude the resolved values is better than sending the raw values.yaml alone ā many real mistakes only become visible once defaults and overrides are merged.
Step 2: Build the Validation Prompt
from anthropic import Anthropic
import json
client = Anthropic()
def validate_helm_values(values_file_name: str, resolved_values: dict, rendered_manifests: str) -> dict:
prompt = f"""You are a senior platform engineer reviewing a Helm values file
named "{values_file_name}" before it's applied.
RESOLVED VALUES (after merging defaults + overrides):
{json.dumps(resolved_values, indent=2)}
RENDERED MANIFESTS (what Helm will actually apply):
{rendered_manifests[:6000]}
Check for these categories of problems:
1. Resource sanity: limits lower than requests, missing limits entirely,
suspiciously low/high values for the apparent workload type
2. Production readiness mismatches: if the filename suggests "production" or
"prod", flag low replica counts, "latest" image tags, missing PodDisruptionBudget,
missing liveness/readiness probes, missing anti-affinity rules
3. Security: privileged containers, missing securityContext, hostNetwork/hostPID
enabled without clear justification, secrets passed as plain env vars
4. Common typos/logic errors: a value that looks like it was meant to reference
another field but is hardcoded, inconsistent naming patterns suggesting copy-paste
errors from another environment's values
For each issue found, return:
- severity: critical, warning, or info
- field: the specific YAML path
- issue: what's wrong
- suggestion: the specific fix
Return as JSON: {{"issues": [...]}}. If nothing is wrong, return {{"issues": []}}."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)Step 3: Wire It Into CI as a Pre-Merge Gate
# .github/workflows/helm-validate.yml
name: Validate Helm Values
on:
pull_request:
paths:
- 'helm/**/values*.yaml'
jobs:
ai-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run AI values validator
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python scripts/validate_helm_values.py \
--chart ./helm/myapp \
--values ./helm/myapp/values-production.yaml \
--fail-on critical# scripts/validate_helm_values.py
import sys
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--chart", required=True)
parser.add_argument("--values", required=True)
parser.add_argument("--fail-on", default="critical", choices=["critical", "warning"])
args = parser.parse_args()
resolved = get_resolved_values(args.chart, args.values)
manifests = get_rendered_manifests(args.chart, args.values, "ci-validation")
result = validate_helm_values(args.values, resolved, manifests)
severity_order = {"critical": 0, "warning": 1, "info": 2}
fail_threshold = severity_order[args.fail_on]
should_fail = False
for issue in result["issues"]:
marker = "š“" if issue["severity"] == "critical" else "š”" if issue["severity"] == "warning" else "ā¹ļø"
print(f"{marker} [{issue['severity'].upper()}] {issue['field']}: {issue['issue']}")
print(f" ā {issue['suggestion']}")
if severity_order.get(issue["severity"], 2) <= fail_threshold:
should_fail = True
if should_fail:
print(f"\nValidation failed ā issues at or above '{args.fail_on}' severity found.")
sys.exit(1)
print("\nNo blocking issues found.")
if __name__ == "__main__":
main()Sample Output on the Broken Example Above
š“ [CRITICAL] resources.limits.memory: Limit (256Mi) is lower than request (512Mi)
ā this pod will be OOMKilled almost immediately on startup since Kubernetes
uses the request for scheduling but enforces the limit as a hard ceiling.
ā Set limits.memory to at least 512Mi, ideally with headroom (e.g. 768Mi)
š” [WARNING] image.tag: "latest" used in a file named values-production.yaml
ā Pin to a specific version tag or SHA for production deployments ā "latest"
makes rollbacks unpredictable since the tag's target can silently change
š” [WARNING] replicaCount: Set to 1 in a production values file
ā Consider at least 2-3 replicas for high availability, combined with a
PodDisruptionBudget to prevent all replicas being evicted simultaneously
This catches exactly the kind of mistake that passes every automated lint check and still causes a 2 AM page ā the validator earns its keep on logical correctness, not syntax, which is the gap schema-based tools structurally can't close.
Generate Helm charts with AI too: Build a Helm Chart Generator with AI
Today I Fixed
Short real fixes from production ā posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam ā just practical engineering content.
Related Articles
Build an AI-Powered Helm Chart Generator with Claude API
Writing a Helm chart from scratch is tedious. Build a tool that takes a service description and generates a production-ready Helm chart with values.yaml, templates, and a test suite.
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection ā using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks ā catches what thresholds can't.