| 3 | | = Module 8 of 8 - Job Checkpointing = |
| 4 | | Coming soon |
| | 3 | = Module 8 of 8 - Job Checkpointing (Under construction) = |
| | 4 | (content subject to change prior to the workshop) |
| | 5 | |
| | 6 | == What is Job Checkpointing? == |
| | 7 | Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage. |
| | 8 | |
| | 9 | == Why is Job Checkpointing important? == |
| | 10 | |
| | 11 | * Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide]. |
| | 12 | * A parallel MPI job can fail as soon as a single node in use crashes. |
| | 13 | * Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].) |
| | 14 | |
| | 15 | == Impacts of Job Checkpointing == |
| | 16 | |
| | 17 | * Job scripts require provision for requeuing after termination due to timing out or pre-emption. |
| | 18 | * Applications without built-in checkpointing require additional programming effort. |
| | 19 | * Runtime storing of complete program state at regular intervals requires additional time and I/O resources. |
| | 20 | == Why use a HPC Cluster? == |
| | 21 | * '''tasks take too long''' |
| | 22 | * When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere. |
| | 23 | * Your computation may execute more efficiently if the code supports multithreading or multiprocessing. |
| | 24 | |
| | 25 | * '''one server is not enough''' |
| | 26 | * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers. |
| | 27 | |
| | 28 | == Job Checkpointing examples == |
| | 29 | * TBD |