Context Navigation

Changes between Version 4 and Version 5 of Workshops/JobCheckpointing

Timestamp:: 01/21/2026 12:36:24 PM (7 weeks ago)
Author:: Carl Baribault
Comment:: Moved "why...HPC" first, added links for software-with-built-in, job-examples

Legend:

: Unmodified
: Added
: Removed
: Modified

Workshops/JobCheckpointing

-              v4
+              v5
 (content subject to change prior to the workshop)
-== What is Job Checkpointing? ==
-Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
-== Why is Job Checkpointing important? ==
-* Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
-* Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
-* A parallel MPI job can fail as soon as a single node in use crashes.
-* Cloud-based job queues with high availability can enforce the use of pre-emptible job queues.
-== Impacts of Job Checkpointing ==
-* Job scripts require provision for requeuing after termination due to timing out or pre-emption.
-* Applications without built-in checkpointing require additional programming effort.
-* Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
 == Why use a HPC Cluster? ==
 * '''tasks take too long'''
 …
  * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.
+== Job Checkpointing examples ==
+* TBD
+== What is job checkpointing? ==
+Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
+== Why is job checkpointing important? ==
+* Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
+* Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
+* A parallel MPI job can fail as soon as a single node in use crashes.
+* Cloud-based job queues with high availability can enforce the use of pre-emptible job queues.
+== Impacts of job checkpointing ==
+* Job scripts require provision for requeuing after termination due to timing out or pre-emption.
+* Application software without built-in checkpointing require additional programming effort.
+* Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
+== Software with built-in job checkpointing ==
+See [wiki:Workshops/JobCheckpointing/SoftwareWithBuiltinCheckpointing Software With Built-in Checkpointing]
+== Job checkpointing examples ==
+See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples]