| Version 5 (modified by , 4 hours ago) ( diff ) |
|---|
HPC Workshop Spring 2026
Module 8 of 8 - Job Checkpointing (Under construction)
(content subject to change prior to the workshop)
Why use a HPC Cluster?
- tasks take too long
- When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere.
- Your computation may execute more efficiently if the code supports multithreading or multiprocessing.
- one server is not enough
- When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.
What is job checkpointing?
Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
Why is job checkpointing important?
- Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in Scheduling Configuration Guide.
- Most production clusters enforce strict walltime limits. (See see SLURM (resource manager).)
- A parallel MPI job can fail as soon as a single node in use crashes.
- Cloud-based job queues with high availability can enforce the use of pre-emptible job queues.
Impacts of job checkpointing
- Job scripts require provision for requeuing after termination due to timing out or pre-emption.
- Application software without built-in checkpointing require additional programming effort.
- Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
Software with built-in job checkpointing
See Software With Built-in Checkpointing
Job checkpointing examples
Note:
See TracWiki
for help on using the wiki.
