[[PageOutline]] = HPC Workshop Spring 2026 = = Module 8 of 8 - Job Checkpointing (Under construction) = (content subject to change prior to the workshop) == What is Job Checkpointing? == Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage. == Why is Job Checkpointing important? == * Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide]. * Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].) * A parallel MPI job can fail as soon as a single node in use crashes. * Cloud-based job queues with high availability can enforce the use of pre-emptible job queues. == Impacts of Job Checkpointing == * Job scripts require provision for requeuing after termination due to timing out or pre-emption. * Applications without built-in checkpointing require additional programming effort. * Runtime storing of complete program state at regular intervals requires additional time and I/O resources. == Why use a HPC Cluster? == * '''tasks take too long''' * When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere.  * Your computation may execute more efficiently if the code supports multithreading or multiprocessing. * '''one server is not enough''' * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers. == Job Checkpointing examples == * TBD