[[PageOutline]] = HPC Workshop Spring 2026 = = Module 8 of 8 - Job Checkpointing = == Why use a HPC Cluster? == * '''tasks take too long''' * When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere.  * Your computation may execute more efficiently if the code supports multithreading or multiprocessing. * '''one server is not enough''' * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers. == What is job checkpointing? == === Recovery from logical or external interruption === Job checkpointing is the job's recovering from being interrupted either by its own logic or externally. == Pro's of job checkpointing == * '''Checkpointed jobs can get started sooner''' out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide]. * '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters. * See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)]. * See also the command * '''sacctmgr show qos format=Name,!MaxWall,!MaxNodesPerUser | grep -E "normal|long"''' * '''Checkpointing mitigates job failures due to node crashes''' - especially for long running parallel MPI jobs. * '''Checkpointed jobs can handle frequent job pre-emption''' - especially for certain cloud-based job queues with high availability. == Con's of job checkpointing == Checkpointing requires a level of coordination between the job script and the job's application software in order to perform the following. See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples] * '''Recording execution progress'''... * at regular time intervals '''and/or''' * after catching a Signal Terminate signal (or '''SIGTERM''') from the Operating System. * '''Execution progress includes'''... * the intermediate results '''and''' * the current '''state of execution''' or '''state'''. * '''Requeue''' itself (or otherwise re-submit) as needed after interruption. * '''Resume''' from the recorded state after being restarted from the job queue. == Software with built-in job checkpointing == See [wiki:Workshops/JobCheckpointing/SoftwareWithBuiltinCheckpointing Software With Built-in Checkpointing] == Job checkpointing examples == See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples]