All Articles

Kubernetes Job / CronJob Not Completing — Causes and Fixes (2026)

Kubernetes Job stuck in Running, CronJob never triggering, or pods completing but Job shows Failed? Here are the real causes and fixes for each scenario.

DevOpsBoysApr 16, 20265 min read
Share:Tweet

Jobs and CronJobs behave differently from regular Deployments. When they break, the error messages are often misleading. Here's every common failure mode and how to fix it.

Quick Diagnosis

bash
# Check Job status
kubectl get jobs -n my-namespace
kubectl describe job my-job -n my-namespace
 
# Check pods created by the Job
kubectl get pods -n my-namespace --selector=job-name=my-job
 
# Check CronJob status
kubectl get cronjobs -n my-namespace
kubectl describe cronjob my-cronjob -n my-namespace
 
# Last 3 jobs created by CronJob
kubectl get jobs -n my-namespace --selector=cronjob=my-cronjob

Cause 1: Pod Exits with Non-Zero Code → Job Marked Failed

A Job succeeds only when its pod exits with code 0. Any other exit code = failure.

Symptom:

bash
kubectl get jobs
# NAME      COMPLETIONS   DURATION   AGE
# my-job    0/1           5m         5m   ← never completes

Check pod logs:

bash
kubectl logs -n my-namespace -l job-name=my-job --previous
# Error: connection refused to postgres:5432

Fix: Find the real error in logs. Common causes:

  • App bug (unhandled exception)
  • Missing environment variable
  • Database/dependency not ready
yaml
# Ensure exit 0 in your script
command: ["sh", "-c", "python run_job.py && exit 0"]

Cause 2: backoffLimit Exhausted — Job Keeps Retrying

By default backoffLimit: 6 — Job retries 6 times before marking itself Failed.

bash
kubectl describe job my-job
# Warning  BackoffLimitExceeded  Job has reached the specified backoff limit

Fix 1: Increase backoffLimit for transient failures:

yaml
spec:
  backoffLimit: 3     # retry 3 times
  template:
    spec:
      restartPolicy: Never   # IMPORTANT: Never or OnFailure, not Always

Fix 2: Fix the underlying error — retrying a broken job wastes resources.

Fix 3: For jobs that should retry on failure but not indefinitely:

yaml
spec:
  backoffLimit: 5
  activeDeadlineSeconds: 300   # kill after 5 minutes regardless

Cause 3: Wrong restartPolicy

Jobs require restartPolicy: Never or restartPolicy: OnFailure. Using Always (Deployment default) causes an error.

bash
kubectl apply -f job.yaml
# Error: Job.batch "my-job" is invalid: 
# spec.template.spec.restartPolicy: Unsupported value: "Always"

Fix:

yaml
spec:
  template:
    spec:
      restartPolicy: Never      # pod won't restart — Job creates new pod
      # OR
      restartPolicy: OnFailure  # pod restarts in-place (cheaper)

Use Never when you need to inspect failed pod logs. Use OnFailure for simple retry behavior.


Cause 4: CronJob Suspended

bash
kubectl get cronjob my-cronjob
# NAME         SCHEDULE    SUSPEND   ACTIVE
# my-cronjob   */5 * * * * True      0      ← suspended!

Fix:

bash
kubectl patch cronjob my-cronjob -p '{"spec":{"suspend":false}}'

Cause 5: CronJob Timezone Issues

CronJob schedules run in the cluster timezone (UTC by default), not your local timezone.

Symptom: Job triggers at wrong time.

bash
# Check cluster timezone
kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.osImage}'

Fix: Set explicit timezone (Kubernetes 1.27+):

yaml
spec:
  schedule: "0 9 * * *"
  timeZone: "Asia/Kolkata"     # IST — runs at 9 AM IST

For older clusters, convert manually: 9 AM IST = 3:30 AM UTC → schedule: "30 3 * * *"


Cause 6: startingDeadlineSeconds Missed

If a CronJob misses its scheduled time (cluster was down, CronJob was suspended), it uses startingDeadlineSeconds to decide whether to catch up.

yaml
spec:
  schedule: "0 * * * *"
  startingDeadlineSeconds: 300    # only start if within 5 minutes of scheduled time
  # if cluster was down for 2 hours, jobs for those hours are SKIPPED

Symptom: Jobs missing after cluster downtime.

Fix: Set startingDeadlineSeconds based on your tolerance:

  • null (default): catch up on all missed runs (dangerous — can create many pods)
  • 300: skip if more than 5 minutes late
  • For critical jobs: 0 means never skip, always catch up

Cause 7: concurrencyPolicy — Jobs Overlapping or Being Skipped

yaml
spec:
  concurrencyPolicy: Forbid    # skip new run if previous is still running
  # Allow   → run multiple jobs concurrently (default)
  # Replace → kill old job, start new one

Symptom with Forbid: CronJob scheduled but not running:

bash
kubectl describe cronjob my-cronjob
# Warning: Cannot determine if job needs to be started. Too many missed start times.

Fix: Use Replace if old job hangs:

yaml
spec:
  concurrencyPolicy: Replace
  activeDeadlineSeconds: 600   # kill job after 10 minutes

Cause 8: Job Completes but Pods Are Deleted Too Fast

By default, completed Job pods are kept for debugging. But if ttlSecondsAfterFinished is set too low:

yaml
spec:
  ttlSecondsAfterFinished: 10   # pods deleted 10s after job finishes

You run kubectl logs and get "pod not found."

Fix: Increase TTL or remove it:

yaml
spec:
  ttlSecondsAfterFinished: 3600   # keep pods for 1 hour after completion

Cause 9: Parallel Jobs — Not Enough Completions

For parallel jobs, completions and parallelism must be set correctly:

yaml
spec:
  completions: 10      # need 10 successful pods total
  parallelism: 3       # run 3 at a time
  backoffLimit: 5

Symptom: Job stuck at "5/10" completions — some pods failing, consuming backoff retries.

Fix: Check which pods are failing: kubectl get pods -l job-name=my-job and look for Error/OOMKilled pods.


Full Working Job Example

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 600       # fail after 10 min
  ttlSecondsAfterFinished: 3600    # keep pods 1 hour
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: migrate
        image: my-app:v1.2
        command: ["python", "manage.py", "migrate"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: DATABASE_URL
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

Full Working CronJob Example

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-cleanup
spec:
  schedule: "0 0 * * *"
  timeZone: "Asia/Kolkata"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  startingDeadlineSeconds: 300
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800
      ttlSecondsAfterFinished: 86400
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: cleanup
            image: my-cleanup:latest
            command: ["python", "cleanup.py"]

Debug Checklist

bash
# Job not completing?
kubectl describe job <name>           # check Events and Status
kubectl logs -l job-name=<name>       # app errors
kubectl get pods -l job-name=<name>   # pod states
 
# CronJob not triggering?
kubectl describe cronjob <name>       # check SUSPEND, last schedule
kubectl get events --field-selector reason=FailedCreate
 
# Check CronJob history
kubectl get jobs --selector=cronjob=<name>

Resources

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments