Kubernetes OOMKilled: How I Fixed Out of Memory Errors in Production
OOMKilled crashes killing your pods? Here's the real cause, how to diagnose it fast, and the exact steps to fix it without breaking production.
It was 2 AM. Slack was blowing up. Pods were crashing every 3 minutes with OOMKilled. No logs, no warning — just a dead pod and an angry on-call engineer.
If you've been there, this post is for you. I'll walk through exactly what OOMKilled means, why it happens, how to diagnose it, and how to fix it permanently.
What Does OOMKilled Actually Mean?
OOMKilled stands for Out of Memory Killed. When a container exceeds its memory limit, the Linux kernel's OOM (Out Of Memory) Killer steps in and terminates the process. Kubernetes then reports this as OOMKilled in the pod status.
This is NOT a Kubernetes bug. It's working exactly as designed. You told Kubernetes "this container should use no more than X memory" — and when the container broke that promise, it got killed.
The real problem is usually one of three things:
- Your memory limit is too low for what the app actually needs
- Your app has a memory leak
- A sudden traffic spike consumed more memory than expected
Understanding which one is your issue determines the fix.
Step 1 — Confirm It's OOMKilled
First, verify what's actually happening:
kubectl get pods -n your-namespaceYou'll see something like:
NAME READY STATUS RESTARTS AGE
api-deployment-xyz 0/1 OOMKilled 5 12m
Get the detailed status:
kubectl describe pod api-deployment-xyz -n your-namespaceLook for this in the output:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 12 Mar 2026 01:45:00 +0000
Finished: Thu, 12 Mar 2026 01:48:23 +0000
Exit code 137 confirms OOMKilled (128 + 9, where 9 is SIGKILL).
Step 2 — Check Current Memory Limits
kubectl get pod api-deployment-xyz -n your-namespace -o jsonpath='{.spec.containers[*].resources}'Or look at the deployment:
kubectl get deployment api-deployment -n your-namespace -o yaml | grep -A 8 resourcesYou're looking for something like:
resources:
requests:
memory: "128Mi"
limits:
memory: "256Mi"Now check how much memory the app was actually using before it got killed:
kubectl top pod api-deployment-xyz -n your-namespaceIf the pod is already dead, check recent metrics:
kubectl top pod --sort-by=memory -n your-namespaceStep 3 — Look at Historical Memory Usage
This is where most engineers skip a step and make things worse. Before you bump the limit, understand the pattern of memory usage.
If you have Prometheus + Grafana set up (you should — here's a guide), query:
container_memory_working_set_bytes{pod=~"api-deployment.*", namespace="your-namespace"}This shows you working set memory over time. Look for:
- Gradual increase over hours → memory leak
- Sudden spike at specific time → traffic event or batch job
- Consistent near-limit usage → limit is just too low
Step 4 — Fix Based on Root Cause
Case A: Limit is just too low
If your app consistently uses 200Mi but your limit is 256Mi, you're cutting it too close. Increase the limit with breathing room:
# deployment.yaml
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi"Apply it:
kubectl apply -f deployment.yamlRule of thumb: Set limits at 2x your average memory usage. This handles traffic spikes without wasting resources.
Case B: Memory leak in the application
If memory grows steadily over time and never drops, you have a leak. The container restart is actually masking it — you need to fix the code.
As a temporary workaround while the code fix is being developed, you can configure a liveness probe that restarts the pod when memory reaches a threshold. But this is a band-aid, not a solution.
For Node.js apps, check the V8 heap size:
kubectl exec -it api-deployment-xyz -- node -e "console.log(process.memoryUsage())"For Java apps, check for GC pressure:
kubectl exec -it api-deployment-xyz -- jstat -gc 1 1000Case C: Traffic spike
If OOMKilled happens at predictable high-traffic times, you need to either:
- Increase limits to handle peak traffic
- Add Horizontal Pod Autoscaling so more pods share the load:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-deployment-hpa
namespace: your-namespace
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70kubectl apply -f hpa.yamlNow when memory crosses 70% of the request value, Kubernetes scales out instead of killing pods.
Step 5 — Set Proper Resource Requests (Most Skipped Step)
Here's a mistake I see constantly: people set limits correctly but leave requests too low.
Requests tell the scheduler how much memory to reserve for the pod. If your request is 64Mi but your app needs 256Mi, the scheduler might pack too many pods onto a node, and they all end up fighting for memory.
Set requests to your average usage, and limits to your peak + buffer:
resources:
requests:
memory: "256Mi" # average usage
limits:
memory: "512Mi" # peak + 2x bufferStep 6 — Use LimitRange to Prevent Future Issues
If your team keeps forgetting to set resource limits, use a LimitRange to enforce defaults at the namespace level:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: your-namespace
spec:
limits:
- default:
memory: "512Mi"
cpu: "500m"
defaultRequest:
memory: "256Mi"
cpu: "250m"
type: Containerkubectl apply -f limitrange.yamlNow every container that doesn't specify limits will get these defaults automatically.
Quick Diagnostic Cheatsheet
| Symptom | Likely Cause | Fix |
|---|---|---|
| Crash every few hours | Memory leak | Fix code, add leak detection |
| Crash at peak traffic | Limit too low | Increase limits + add HPA |
| Crash immediately on start | Limit far too low | Check JVM/runtime startup memory |
| Multiple pods crashing | Node memory pressure | Check node capacity, add nodes |
Monitoring Alerts to Set Up
Once fixed, add these Prometheus alerts so you catch it before it crashes:
- alert: PodMemoryNearLimit
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Pod memory above 85% of limit"
description: "{{ $labels.pod }} in {{ $labels.namespace }} is at {{ $value | humanizePercentage }} memory"This fires when any pod reaches 85% of its memory limit — giving you time to act before it OOMKills.
Wrapping Up
OOMKilled sounds scary but it's one of the more fixable Kubernetes issues once you understand the cause. The key steps:
- Confirm with
kubectl describeand exit code 137 - Check current limits vs actual usage
- Look at the memory pattern (leak vs spike vs too-low limit)
- Fix with the right approach for your cause
- Add HPA for traffic-driven spikes
- Set LimitRange to prevent future config gaps
Want to go deeper on Kubernetes resource management and troubleshooting? The KodeKloud Kubernetes course is the best hands-on resource I've found — they have dedicated labs for exactly these production scenarios.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Why Your Docker Container Keeps Restarting (and How to Fix It)
CrashLoopBackOff, OOMKilled, exit code 1, exit code 137 — Docker containers restart for specific, diagnosable reasons. Here is how to identify the exact cause and fix it in minutes.
Kubernetes Pod Stuck in Pending State: Every Cause and Fix (2026)
Your pod says Pending and nothing is happening. Here's how to diagnose every possible reason — insufficient resources, taints, PVC issues, node selectors — and fix them fast.
Kubernetes Troubleshooting Guide 2026: Fix Every Common Problem
The most complete Kubernetes troubleshooting guide for 2026. Learn how to diagnose and fix Pod crashes, ImagePullBackOff, OOMKilled, CrashLoopBackOff, networking issues, PVC problems, node NotReady, and more — with exact kubectl commands.