All Articles

Kubernetes OOMKilled: How I Fixed Out of Memory Errors in Production

OOMKilled crashes killing your pods? Here's the real cause, how to diagnose it fast, and the exact steps to fix it without breaking production.

DevOpsBoysMar 12, 20265 min read
Share:Tweet

It was 2 AM. Slack was blowing up. Pods were crashing every 3 minutes with OOMKilled. No logs, no warning — just a dead pod and an angry on-call engineer.

If you've been there, this post is for you. I'll walk through exactly what OOMKilled means, why it happens, how to diagnose it, and how to fix it permanently.

What Does OOMKilled Actually Mean?

OOMKilled stands for Out of Memory Killed. When a container exceeds its memory limit, the Linux kernel's OOM (Out Of Memory) Killer steps in and terminates the process. Kubernetes then reports this as OOMKilled in the pod status.

This is NOT a Kubernetes bug. It's working exactly as designed. You told Kubernetes "this container should use no more than X memory" — and when the container broke that promise, it got killed.

The real problem is usually one of three things:

  • Your memory limit is too low for what the app actually needs
  • Your app has a memory leak
  • A sudden traffic spike consumed more memory than expected

Understanding which one is your issue determines the fix.

Step 1 — Confirm It's OOMKilled

First, verify what's actually happening:

bash
kubectl get pods -n your-namespace

You'll see something like:

NAME                    READY   STATUS      RESTARTS   AGE
api-deployment-xyz      0/1     OOMKilled   5          12m

Get the detailed status:

bash
kubectl describe pod api-deployment-xyz -n your-namespace

Look for this in the output:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Thu, 12 Mar 2026 01:45:00 +0000
  Finished:     Thu, 12 Mar 2026 01:48:23 +0000

Exit code 137 confirms OOMKilled (128 + 9, where 9 is SIGKILL).

Step 2 — Check Current Memory Limits

bash
kubectl get pod api-deployment-xyz -n your-namespace -o jsonpath='{.spec.containers[*].resources}'

Or look at the deployment:

bash
kubectl get deployment api-deployment -n your-namespace -o yaml | grep -A 8 resources

You're looking for something like:

yaml
resources:
  requests:
    memory: "128Mi"
  limits:
    memory: "256Mi"

Now check how much memory the app was actually using before it got killed:

bash
kubectl top pod api-deployment-xyz -n your-namespace

If the pod is already dead, check recent metrics:

bash
kubectl top pod --sort-by=memory -n your-namespace

Step 3 — Look at Historical Memory Usage

This is where most engineers skip a step and make things worse. Before you bump the limit, understand the pattern of memory usage.

If you have Prometheus + Grafana set up (you should — here's a guide), query:

promql
container_memory_working_set_bytes{pod=~"api-deployment.*", namespace="your-namespace"}

This shows you working set memory over time. Look for:

  • Gradual increase over hours → memory leak
  • Sudden spike at specific time → traffic event or batch job
  • Consistent near-limit usage → limit is just too low

Step 4 — Fix Based on Root Cause

Case A: Limit is just too low

If your app consistently uses 200Mi but your limit is 256Mi, you're cutting it too close. Increase the limit with breathing room:

yaml
# deployment.yaml
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Apply it:

bash
kubectl apply -f deployment.yaml

Rule of thumb: Set limits at 2x your average memory usage. This handles traffic spikes without wasting resources.

Case B: Memory leak in the application

If memory grows steadily over time and never drops, you have a leak. The container restart is actually masking it — you need to fix the code.

As a temporary workaround while the code fix is being developed, you can configure a liveness probe that restarts the pod when memory reaches a threshold. But this is a band-aid, not a solution.

For Node.js apps, check the V8 heap size:

bash
kubectl exec -it api-deployment-xyz -- node -e "console.log(process.memoryUsage())"

For Java apps, check for GC pressure:

bash
kubectl exec -it api-deployment-xyz -- jstat -gc 1 1000

Case C: Traffic spike

If OOMKilled happens at predictable high-traffic times, you need to either:

  1. Increase limits to handle peak traffic
  2. Add Horizontal Pod Autoscaling so more pods share the load:
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-deployment-hpa
  namespace: your-namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
bash
kubectl apply -f hpa.yaml

Now when memory crosses 70% of the request value, Kubernetes scales out instead of killing pods.

Step 5 — Set Proper Resource Requests (Most Skipped Step)

Here's a mistake I see constantly: people set limits correctly but leave requests too low.

Requests tell the scheduler how much memory to reserve for the pod. If your request is 64Mi but your app needs 256Mi, the scheduler might pack too many pods onto a node, and they all end up fighting for memory.

Set requests to your average usage, and limits to your peak + buffer:

yaml
resources:
  requests:
    memory: "256Mi"   # average usage
  limits:
    memory: "512Mi"   # peak + 2x buffer

Step 6 — Use LimitRange to Prevent Future Issues

If your team keeps forgetting to set resource limits, use a LimitRange to enforce defaults at the namespace level:

yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: your-namespace
spec:
  limits:
  - default:
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:
      memory: "256Mi"
      cpu: "250m"
    type: Container
bash
kubectl apply -f limitrange.yaml

Now every container that doesn't specify limits will get these defaults automatically.

Quick Diagnostic Cheatsheet

SymptomLikely CauseFix
Crash every few hoursMemory leakFix code, add leak detection
Crash at peak trafficLimit too lowIncrease limits + add HPA
Crash immediately on startLimit far too lowCheck JVM/runtime startup memory
Multiple pods crashingNode memory pressureCheck node capacity, add nodes

Monitoring Alerts to Set Up

Once fixed, add these Prometheus alerts so you catch it before it crashes:

yaml
- alert: PodMemoryNearLimit
  expr: |
    (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pod memory above 85% of limit"
    description: "{{ $labels.pod }} in {{ $labels.namespace }} is at {{ $value | humanizePercentage }} memory"

This fires when any pod reaches 85% of its memory limit — giving you time to act before it OOMKills.

Wrapping Up

OOMKilled sounds scary but it's one of the more fixable Kubernetes issues once you understand the cause. The key steps:

  1. Confirm with kubectl describe and exit code 137
  2. Check current limits vs actual usage
  3. Look at the memory pattern (leak vs spike vs too-low limit)
  4. Fix with the right approach for your cause
  5. Add HPA for traffic-driven spikes
  6. Set LimitRange to prevent future config gaps

Want to go deeper on Kubernetes resource management and troubleshooting? The KodeKloud Kubernetes course is the best hands-on resource I've found — they have dedicated labs for exactly these production scenarios.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments