wiki:Workshops/JobCheckpointing

Version 7 (modified by Carl Baribault, 2 days ago) ( diff )

Remoded message "under construction"

HPC Workshop Spring 2026

Module 8 of 8 - Job Checkpointing

Why use a HPC Cluster?

  • tasks take too long
    • When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere. 
    • Your computation may execute more efficiently if the code supports multithreading or multiprocessing.

  • one server is not enough
    • When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.

What is job checkpointing?

Job checkpointing is the process of ensuring or programming the job's application software with the ability to save partial results as well as resume processing after termination at end of requested walltime (walltime termination).

A checkpointed job must be able to perform the following.

  • The application must record it's progress (see the following) at one or both of the following times.
    • At regular time intervals on its own (This option is preferred for very long - or sufficiently long - running jobs where system crashes are more likely.) - or
    • After catching a Signal Terminate signal (or SIGTERM) from the Operating System, where the signal is programmed in the job script to allow for completion of recording before walltime termination. For example, in your job script…
      • either via sbatch directives
           # --- Append to output and error files ---
           #SBATCH --open-mode=append
           # --- Enable automatic requeue ---
           #SBATCH --requeue
           # --- Send SIGTERM 2 minutes before walltime ---
           #SBATCH --signal=TERM@120
        
      • or bash timeout command followed by requeue
        timeout 23h ./my_simulation || scontrol requeue $SLURM_JOB_ID
        
  • The application must record both the work already performed as well as the current or recent state of execution or state.
  • When the job is requeued, the application must read the recorded state and resume from that point of execution with the previous work preserved.

Why is job checkpointing important - and beneficial?

  • Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in Scheduling Configuration Guide.
  • Checkpointed jobs compensate for strict walltime limits enforced by Cypress, LONI, and most other production clusters. (See see SLURM (resource manager).)
  • Checkpointed jobs running parallel MPI (especially long running jobs recording at regular intervals) can fail as soon as a single node in use crashes.
  • Checkpointed jobs running in certain cloud-based job queues with high availability can experience strictly enforced job pre-emption (SIGTERM signals).

What are the impacts of job checkpointing?

  • Job scripts require provision for requeuing after termination due to timing out or pre-emption.
  • Application software without built-in checkpointing require additional programming effort.
  • Runtime storing of complete program state at regular intervals requires additional time and I/O resources.

Software with built-in job checkpointing

See Software With Built-in Checkpointing

Job checkpointing examples

See Job Checkpointing Examples

Note: See TracWiki for help on using the wiki.