| Version 10 (modified by , 2 days ago) ( diff ) |
|---|
HPC Workshop Spring 2026
Module 8 of 8 - Job Checkpointing
Why use a HPC Cluster?
- tasks take too long
- When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere.
- Your computation may execute more efficiently if the code supports multithreading or multiprocessing.
- one server is not enough
- When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.
What is job checkpointing?
Recovery from logical or external interruption
Job checkpointing is the job's recovering from being interrupted either by its own logic or externally.
Pro's of job checkpointing
- Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in Scheduling Configuration Guide.
- More checkpointed jobs can run simultaneously due to strict limits enforced by Cypress, LONI, and most other production clusters.
- See see --qos=normal in SLURM (resource manager).
- See also the command
- sacctmgr show qos format=Name,MaxWall,MaxNodesPerUser | grep -E "normal|long"
- Checkpointing mitigates job failures due to node crashes - especially for long running parallel MPI jobs.
- Checkpointed jobs can handle frequent job pre-emption - especially for certain cloud-based job queues with high availability.
Con's of job checkpointing
Checkpointing requires a level of coordination between the job script and the job's application software in order to perform the following. See Job Checkpointing Examples
- Recording execution progress...
- at regular time intervals and/or
- after catching a Signal Terminate signal (or SIGTERM) from the Operating System.
- Execution progress includes...
- the intermediate results and
- the current state of execution or state.
- Requeue itself (or otherwise re-submit) as needed after interruption.
- Resume from the recorded state after being restarted from the job queue.
Software with built-in job checkpointing
See Software With Built-in Checkpointing
Job checkpointing examples
Note:
See TracWiki
for help on using the wiki.
