Kubernetes CronJobs: The Schedule That Survives a Cluster Restart

A Kubernetes CronJob is a controller that creates a Job on a recurring schedule, the same way cron creates processes on a Linux box. The controller watches the schedule, and at each scheduled time it creates a new Job that runs a pod with the developer’s workload. The Job runs the pod to completion, the controller records the result, and the cycle repeats at the next scheduled time. The CronJob is the part of Kubernetes that turns “run this on a schedule” into a first-class resource.

The reason “kubernetes cron job” is its own question and not just “cron in k8s” is that the CronJob is not just a cron line. The CronJob is a controller with its own spec (the schedule, the concurrency policy, the history limit, the restart policy), its own status (the last schedule time, the last successful time, the active jobs), and its own failure modes (missed runs, overlapping jobs, suspended schedules, time zones). The spec is the part the developer should know before they ship the first CronJob.

The short version
The five fields in a CronJob spec that matter
The three concurrency policies that decide what happens on overlap
The four restart policies that decide what happens on failure
The way time zones change what “every hour” means in k8s
The six mistakes that quietly break a CronJob in production
The four patterns that cover 80% of real CronJobs
FAQ

The short version

A CronJob is a YAML manifest with a schedule (a cron expression), a jobTemplate (the Job spec the controller will create), and a few policy fields (concurrency, history, restart, suspend). The controller reads the spec, watches the schedule, and creates a Job at each scheduled time. The Job runs a pod with the developer’s workload. The pod completes (or fails), the controller records the result, and the cycle repeats. The CronJob is the right answer for any scheduled workload that needs to run on a cluster, and the right answer for the workloads that have outgrown crontab.

The five fields in a CronJob spec that matter

A CronJob spec is a small block of YAML, but the five fields are the ones that decide what the controller does. The other fields are useful but secondary.

schedule. The cron expression. The format is the standard 5-field cron (minute, hour, day of month, month, day of week). The expression is evaluated in the controller’s time zone, which is UTC by default. The expression can also be a list (0,30 * * * * for every 30 minutes) or a step (*/15 * * * * for every 15 minutes). The schedule is the heart of the spec.

concurrencyPolicy. The policy that decides what happens when a new Job is scheduled while the previous one is still running. The values are Allow (run the new Job in parallel with the old one), Forbid (skip the new Job if the old one is still running), and Replace (kill the old Job and start the new one). The default is Allow, which is almost never what the developer wants in production.

startingDeadlineSeconds. The deadline for starting a Job. If the controller is unable to start the Job within the deadline (because the cluster is overloaded, because the pod is unschedulable, because the controller was down), the Job is marked as missed and skipped. The default is no deadline, which means missed Jobs pile up and try to run all at once when the cluster recovers.

successfulJobsHistoryLimit and failedJobsHistoryLimit. The number of completed and failed Jobs to keep in the cluster. The default is 3 for successful and 1 for failed, which is too low for most production setups. The fields are the lever for the cluster’s storage cost, and the developer should set them based on how long the team needs to look back at a Job’s logs.

The five fields are the floor. The other fields (suspend, timeZone, jobsActive, lastScheduleTime) are useful, but the five are the ones the developer has to get right.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-db
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 300
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: my-backup:latest
            args: ["/bin/sh", "-c", "pg_dump $DATABASE_URL > /backup/db.sql"]
          restartPolicy: OnFailure

The three concurrency policies that decide what happen on overlap

A CronJob that runs every 5 minutes is a CronJob that is going to overlap with itself if any individual run takes more than 5 minutes. The concurrency policy is the field that decides what happens on the overlap. The three policies are:

Allow (default). The new Job starts even if the old one is still running. The two Jobs run in parallel, each with its own pod, each holding its own resources. The policy is fine for a job that is read-only (a metrics scrape, a cache warm, a health check) and dangerous for a job that writes (a database backup, a data migration, a payment reconciliation). The default is the wrong answer for most production workloads.

Forbid. The new Job is skipped if the old one is still running. The skipped run is recorded as a missed schedule, and the controller moves on to the next one. The policy is the right answer for a job that writes to a shared resource (a database, a file system, a third-party API) and that should not run concurrently. The policy is the right answer for most production workloads.

Replace. The old Job is killed (the pod is deleted) and the new Job starts. The policy is the right answer for a job that has a hard deadline (a “must finish by X” requirement) and where the developer is willing to lose a running job to make sure the new one runs on time. The policy is rarely the right answer, because killing a running job usually wastes more time than skipping the new run.

The three policies map to three different operational intents. The developer should pick the policy that matches the workload, not the default.

The four restart policies that decide what happens on failure

The restart policy is set on the pod template inside the Job spec, and it decides what happens to the container when it exits. The four policies are:

OnFailure (the right answer for most CronJobs). The container is restarted if it exits with a non-zero status. The restart is recorded in the pod’s status, and the Job’s backoff limit (default 6) determines how many times the container is restarted before the Job is marked as failed. The policy is the right answer for a job that should be retried on transient failures (a network blip, a database hiccup, a third-party timeout) but should not be retried forever.

Never. The container is not restarted. The Job is marked as failed as soon as the container exits with a non-zero status. The policy is the right answer for a job that should not be retried (a data migration, a one-shot script, a job where the failure is not transient). The policy is the right answer for jobs where the developer wants to know about the failure immediately.

Always (rare for CronJobs). The container is restarted regardless of the exit status. The policy is the right answer for a long-running process (a web server, a worker) but is almost never the right answer for a CronJob, because a CronJob that restarts on success is a CronJob that runs forever.

The implicit “the Job runs to completion” model. A Job is a one-shot workload. The pod runs the container, the container exits, the Job is done. The restart policy decides what happens if the container exits before the work is done. The Job is not a long-running process, and the developer should not use a CronJob to run a long-running process.

The four policies are not the only consideration — the developer should also set the backoffLimit on the Job spec to cap the number of retries, and the activeDeadlineSeconds to cap the total runtime of the Job. The combination of OnFailure + a backoff limit is the typical production setup.

The way time zones change what “every hour” means in k8s

The CronJob schedule is evaluated in the controller’s time zone, which is UTC by default. The developer can override the time zone with the timeZone field, which accepts a tz database name (e.g. America/Los_Angeles, Europe/London, Asia/Tokyo). The override is the right answer for a job that needs to run at a specific local time, regardless of the cluster’s time zone.

The default is UTC. A CronJob that runs 0 9 * * * will run at 09:00 UTC, every day. For a team in America/Los_Angeles, that is 01:00 PST (or 02:00 PDT during daylight saving). The default is fine for a job that does not have a local-time requirement, and the developer should set the time zone explicitly for a job that does.

The timeZone field. The field accepts a tz database name, and Kubernetes uses the standard tz database. The field is set on the CronJob spec, and the controller uses it to evaluate the schedule. The field is supported in Kubernetes 1.27 and later, and on managed services that run a recent Kubernetes version.

Daylight saving. A CronJob that runs 0 2 * * * in America/Los_Angeles will run at 02:00 PST or 02:00 PDT, depending on the time of year. On the day of a DST jump, the job will run 23 or 25 times that day, depending on the direction. The fix is the same as for cron — move the job outside the DST window, or run the job in UTC and convert at the application layer.

The time zone is the part that quietly breaks a schedule that was supposed to be predictable. The developer should set the time zone explicitly, and should know what the cluster’s default is.

The six mistakes that quietly break a CronJob in production

A short, opinionated list of mistakes that have actually broken real CronJobs in production. None of them are dramatic. They are the boring ones.

The default concurrencyPolicy: Allow. A CronJob that runs every 5 minutes and takes 7 minutes per run is a CronJob that has 2 pods running at any given time after the first hour. The pods are using cluster resources, holding database connections, and writing to shared storage. The fix is concurrencyPolicy: Forbid, which skips the new run if the old one is still going.

The default successfulJobsHistoryLimit: 3 and failedJobsHistoryLimit: 1. A CronJob that runs every 5 minutes is a CronJob that produces 288 Jobs per day. The default limits keep 3 successful and 1 failed Job, which is fine for the cluster’s storage but terrible for the team’s ability to look at last week’s logs. The fix is to raise the limits to a number that matches the team’s debugging workflow (e.g. 100 successful, 100 failed).

The startingDeadlineSeconds is not set. A cluster that is overloaded or that has just recovered from a node failure is a cluster that may not be able to start a new Job for several minutes. Without startingDeadlineSeconds, the missed Jobs pile up and try to run all at once when the cluster recovers, which makes the overload worse. The fix is to set startingDeadlineSeconds to a value that matches the cluster’s typical recovery time.

The CronJob is not idempotent. A CronJob that runs twice in the same window (because of Allow, because of a manual trigger, because of a deploy) is a CronJob that will produce duplicate data. The fix is to make the job idempotent — the job should check whether the work has already been done before doing it, or should use a unique key that prevents duplicates.

The image is not pinned. A CronJob that uses image: my-app:latest is a CronJob that will run a different image every time the image is rebuilt. The fix is to pin the image to a specific tag (my-app:v1.2.3) or to a specific digest (my-app@sha256:abc123...). The pinned image is the part that makes the CronJob reproducible.

The restartPolicy is not set on the pod template. A Job template without a restartPolicy defaults to Always, which is the wrong answer for a CronJob. The fix is to set restartPolicy: OnFailure (or Never) on the pod template inside the Job template.

The four patterns that cover 80% of real CronJobs

A short, opinionated list of patterns that show up in real Kubernetes clusters. The patterns are not the only ones, but they are the ones the developer should learn first.

The database backup pattern. A CronJob that runs pg_dump (or mysqldump, or mongodump) every night, writes the dump to an S3 bucket, and sends a Slack notification on success or failure. The pattern is the most common one in real clusters, and the one that every team with a database eventually ships.

The data cleanup pattern. A CronJob that runs a SQL query to delete old rows, vacuum the table, and update the indexes. The pattern is the right answer for a table that grows unbounded (a logs table, an events table, a sessions table), and the one that prevents the database from running out of disk.

The report-generation pattern. A CronJob that runs a query, formats the results as a CSV or PDF, and emails the report to a list of recipients. The pattern is the right answer for a daily metrics report, a weekly sales summary, or a monthly billing run.

The cache-warming pattern. A CronJob that runs a script to precompute a cache (a materialized view, a Redis hash, a CDN purge) on a schedule. The pattern is the right answer for a workload where the cache takes too long to compute on the first request, and the developer wants the cache to be warm before the first user hits it.

The four patterns are the floor. There are also patterns for log rotation, certificate renewal, secret rotation, snapshot cleanup, and many more. The four are the ones the developer should learn first, and the ones the team should have in the cluster’s template repository.

How this fits the rest of the stack

A CronJob rarely lives in isolation. The job usually calls an API, queries a database, writes to a file, or sends a notification. The platform that handles the cluster should make the rest of the stack feel like part of the same conversation.

The services layer is the part of the platform that runs the long-lived API the CronJob calls. The database layer is the part that holds the data the job reads and writes. The static layer is the part that hosts the dashboard where the job’s status is reported. The environment variables are the part that holds the secrets the job reads at runtime.

A CronJob on a platform where the API, the database, the storage, and the secrets are all in the same place is a job the team is going to be able to debug. A CronJob on a platform where each piece is in a different console is a job the team is going to spend the first hour just opening the right tab.

For a team that wants to see the full cost of the project before it commits, the RunxBuild hosting calculator shows the line items together. The API, the database, the storage, the worker, the bandwidth — each one is a separate number, and the team’s mental model for the platform is the sum of those numbers.

FAQ

What is a Kubernetes CronJob?

A Kubernetes CronJob is a controller that creates a Job on a recurring schedule, the same way cron creates processes on a Linux box. The controller watches the schedule, and at each scheduled time it creates a new Job that runs a pod with the developer’s workload. The Job runs the pod to completion, the controller records the result, and the cycle repeats at the next scheduled time.

How do I run a CronJob every 5 minutes in Kubernetes?

Use the schedule */5 * * * *. The expression is the same as the standard cron expression, and Kubernetes evaluates it in UTC by default. The developer can set the time zone with the timeZone field if the job needs to run in a specific local time.

What is the difference between a CronJob and a Job in Kubernetes?

A Job is a one-shot workload. The pod runs the container, the container exits, the Job is done. A CronJob is a controller that creates Jobs on a schedule. The CronJob is the right answer for a recurring workload (a backup, a cleanup, a report). The Job is the right answer for a one-off workload (a migration, a one-time script).

How do I prevent a CronJob from running concurrent jobs?

Set concurrencyPolicy: Forbid on the CronJob spec. The policy skips the new Job if the old one is still running, and the skipped run is recorded as a missed schedule. The policy is the right answer for a job that writes to a shared resource (a database, a file system, a third-party API) and that should not run concurrently.

How do I keep CronJob history in Kubernetes?

Set successfulJobsHistoryLimit and failedJobsHistoryLimit on the CronJob spec. The fields control how many completed and failed Jobs the controller keeps in the cluster. The defaults (3 and 1) are usually too low for production. The developer should raise the limits to a number that matches the team’s debugging workflow.

What happens if a CronJob is missed?

A missed run is a run that the controller was unable to start before the startingDeadlineSeconds (if set). The missed run is recorded in the CronJob’s status, and the controller moves on to the next scheduled time. A run that is not started within the deadline is not retried — the controller does not “catch up” missed runs unless concurrencyPolicy: Allow is set, in which case all missed runs may be started at once.

Can a CronJob run in a specific time zone?

Yes. Set the timeZone field on the CronJob spec to a tz database name (e.g. America/Los_Angeles, Europe/London, Asia/Tokyo). The controller will evaluate the schedule in the specified time zone. The field is supported in Kubernetes 1.27 and later, and on managed services that run a recent Kubernetes version.

How do I delete a CronJob in Kubernetes?

Use kubectl delete cronjob <name>. The command removes the CronJob controller, which stops creating new Jobs. Existing Jobs and pods are not affected. The developer should also clean up the Jobs and pods manually if the team’s history limits are not aggressive enough to clean them up automatically.