[[PageOutline]]
= HPC Workshop Spring 2026 =
= Module 8 of 8 - Job Checkpointing =

== Why use a HPC Cluster? ==
* '''tasks take too long'''
 * When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere.  
 * Your computation may execute more efficiently if the code supports multithreading or multiprocessing. 
 
* '''one server is not enough'''
 * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.

== What is job checkpointing? ==
=== Recovery from logical or external interruption ===
Job checkpointing is the job's recovering from being interrupted either by its own logic or externally.

== Pro's of job checkpointing ==

* '''Checkpointed jobs can get started sooner''' out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
* '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters.
 * See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].
 * See also the command
   * '''sacctmgr show qos format=Name,!MaxWall,!MaxNodesPerUser | grep -E "normal|long"'''
* '''Checkpointing mitigates job failures due to node crashes''' - especially for long running parallel MPI jobs.
* '''Checkpointed jobs can handle frequent job pre-emption''' - especially for certain cloud-based job queues with high availability.

== Con's of job checkpointing ==

Checkpointing requires a level of coordination between the job script and the job's application software in order to perform the following. See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples]

* '''Recording execution progress'''...
 * at regular time intervals '''and/or'''
 * after catching a Signal Terminate signal (or '''SIGTERM''') from the Operating System.
 * '''Execution progress includes'''...
  * the intermediate results '''and'''
  * the current '''state of execution''' or '''state'''.
* '''Requeue''' itself (or otherwise re-submit) as needed after interruption.
* '''Resume''' from the recorded state after being restarted from the job queue.

== Software with built-in job checkpointing ==

See [wiki:Workshops/JobCheckpointing/SoftwareWithBuiltinCheckpointing Software With Built-in Checkpointing]

== Job checkpointing examples ==
See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples]