Kubernetes OOMKilled: How I Fixed Out of Memory Errors in Production

OOMKilled crashes killing your pods? Here's the real cause, how to diagnose it fast, and the exact steps to fix it without breaking production.

It was 2 AM. Slack was blowing up. Pods were crashing every 3 minutes with OOMKilled. No logs, no warning — just a dead pod and an angry on-call engineer.

If you've been there, this post is for you. I'll walk through exactly what OOMKilled means, why it happens, how to diagnose it, and how to fix it permanently.

What Does OOMKilled Actually Mean?

OOMKilled stands for Out of Memory Killed. When a container exceeds its memory limit, the Linux kernel's OOM (Out Of Memory) Killer steps in and terminates the process. Kubernetes then reports this as OOMKilled in the pod status.

This is NOT a Kubernetes bug. It's working exactly as designed. You told Kubernetes "this container should use no more than X memory" — and when the container broke that promise, it got killed.

The real problem is usually one of three things:

Your memory limit is too low for what the app actually needs
Your app has a memory leak
A sudden traffic spike consumed more memory than expected

Understanding which one is your issue determines the fix.

Step 1 — Confirm It's OOMKilled

First, verify what's actually happening:

bash

kubectl get pods -n your-namespace

You'll see something like:

NAME                    READY   STATUS      RESTARTS   AGE
api-deployment-xyz      0/1     OOMKilled   5          12m

Get the detailed status:

bash

kubectl describe pod api-deployment-xyz -n your-namespace

Look for this in the output:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Thu, 12 Mar 2026 01:45:00 +0000
  Finished:     Thu, 12 Mar 2026 01:48:23 +0000

Exit code 137 confirms OOMKilled (128 + 9, where 9 is SIGKILL).

Step 2 — Check Current Memory Limits

bash

kubectl get pod api-deployment-xyz -n your-namespace -o jsonpath='{.spec.containers[*].resources}'

Or look at the deployment:

bash

kubectl get deployment api-deployment -n your-namespace -o yaml | grep -A 8 resources

You're looking for something like:

yaml

resources:
  requests:
    memory: "128Mi"
  limits:
    memory: "256Mi"

Now check how much memory the app was actually using before it got killed:

bash

kubectl top pod api-deployment-xyz -n your-namespace

If the pod is already dead, check recent metrics:

bash

kubectl top pod --sort-by=memory -n your-namespace

Step 3 — Look at Historical Memory Usage

This is where most engineers skip a step and make things worse. Before you bump the limit, understand the pattern of memory usage.

If you have Prometheus + Grafana set up (you should — here's a guide), query:

promql

container_memory_working_set_bytes{pod=~"api-deployment.*", namespace="your-namespace"}

This shows you working set memory over time. Look for:

Gradual increase over hours → memory leak
Sudden spike at specific time → traffic event or batch job
Consistent near-limit usage → limit is just too low

Step 4 — Fix Based on Root Cause

Case A: Limit is just too low

If your app consistently uses 200Mi but your limit is 256Mi, you're cutting it too close. Increase the limit with breathing room:

yaml

# deployment.yaml
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Apply it:

bash

kubectl apply -f deployment.yaml

Rule of thumb: Set limits at 2x your average memory usage. This handles traffic spikes without wasting resources.

Case B: Memory leak in the application

If memory grows steadily over time and never drops, you have a leak. The container restart is actually masking it — you need to fix the code.

As a temporary workaround while the code fix is being developed, you can configure a liveness probe that restarts the pod when memory reaches a threshold. But this is a band-aid, not a solution.

For Node.js apps, check the V8 heap size:

bash

kubectl exec -it api-deployment-xyz -- node -e "console.log(process.memoryUsage())"

For Java apps, check for GC pressure:

bash

kubectl exec -it api-deployment-xyz -- jstat -gc 1 1000

Case C: Traffic spike

If OOMKilled happens at predictable high-traffic times, you need to either:

Increase limits to handle peak traffic
Add Horizontal Pod Autoscaling so more pods share the load:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-deployment-hpa
  namespace: your-namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

bash

kubectl apply -f hpa.yaml

Now when memory crosses 70% of the request value, Kubernetes scales out instead of killing pods.

Step 5 — Set Proper Resource Requests (Most Skipped Step)

Here's a mistake I see constantly: people set limits correctly but leave requests too low.

Requests tell the scheduler how much memory to reserve for the pod. If your request is 64Mi but your app needs 256Mi, the scheduler might pack too many pods onto a node, and they all end up fighting for memory.

Set requests to your average usage, and limits to your peak + buffer:

yaml

resources:
  requests:
    memory: "256Mi"   # average usage
  limits:
    memory: "512Mi"   # peak + 2x buffer

Step 6 — Use LimitRange to Prevent Future Issues

If your team keeps forgetting to set resource limits, use a LimitRange to enforce defaults at the namespace level:

yaml

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: your-namespace
spec:
  limits:
  - default:
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:
      memory: "256Mi"
      cpu: "250m"
    type: Container

bash

kubectl apply -f limitrange.yaml

Now every container that doesn't specify limits will get these defaults automatically.

Quick Diagnostic Cheatsheet

Symptom	Likely Cause	Fix
Crash every few hours	Memory leak	Fix code, add leak detection
Crash at peak traffic	Limit too low	Increase limits + add HPA
Crash immediately on start	Limit far too low	Check JVM/runtime startup memory
Multiple pods crashing	Node memory pressure	Check node capacity, add nodes

Monitoring Alerts to Set Up

Once fixed, add these Prometheus alerts so you catch it before it crashes:

yaml

- alert: PodMemoryNearLimit
  expr: |
    (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pod memory above 85% of limit"
    description: "{{ $labels.pod }} in {{ $labels.namespace }} is at {{ $value | humanizePercentage }} memory"

This fires when any pod reaches 85% of its memory limit — giving you time to act before it OOMKills.

Wrapping Up

OOMKilled sounds scary but it's one of the more fixable Kubernetes issues once you understand the cause. The key steps:

Confirm with kubectl describe and exit code 137
Check current limits vs actual usage
Look at the memory pattern (leak vs spike vs too-low limit)
Fix with the right approach for your cause
Add HPA for traffic-driven spikes
Set LimitRange to prevent future config gaps

Want to go deeper on Kubernetes resource management and troubleshooting? The KodeKloud Kubernetes course is the best hands-on resource I've found — they have dedicated labs for exactly these production scenarios.

Kubernetes OOMKilled: How I Fixed Out of Memory Errors in Production

What Does OOMKilled Actually Mean?

Step 1 — Confirm It's OOMKilled

Step 2 — Check Current Memory Limits

Step 3 — Look at Historical Memory Usage

Step 4 — Fix Based on Root Cause

Case A: Limit is just too low

Case B: Memory leak in the application

Case C: Traffic spike

Step 5 — Set Proper Resource Requests (Most Skipped Step)

Step 6 — Use LimitRange to Prevent Future Issues

Quick Diagnostic Cheatsheet

Monitoring Alerts to Set Up

Wrapping Up

Stay ahead of the curve

Related Articles

Why Your Docker Container Keeps Restarting (and How to Fix It)

Kubernetes Pod Stuck in Pending State: Every Cause and Fix (2026)

Kubernetes Troubleshooting Guide 2026: Fix Every Common Problem

Comments