| 6 | | == What is Job Checkpointing? == |
| 7 | | Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage. |
| 8 | | |
| 9 | | == Why is Job Checkpointing important? == |
| 10 | | |
| 11 | | * Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide]. |
| 12 | | * Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].) |
| 13 | | * A parallel MPI job can fail as soon as a single node in use crashes. |
| 14 | | * Cloud-based job queues with high availability can enforce the use of pre-emptible job queues. |
| 15 | | |
| 16 | | == Impacts of Job Checkpointing == |
| 17 | | |
| 18 | | * Job scripts require provision for requeuing after termination due to timing out or pre-emption. |
| 19 | | * Applications without built-in checkpointing require additional programming effort. |
| 20 | | * Runtime storing of complete program state at regular intervals requires additional time and I/O resources. |
| 29 | | == Job Checkpointing examples == |
| 30 | | * TBD |
| | 14 | == What is job checkpointing? == |
| | 15 | Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage. |
| | 16 | |
| | 17 | == Why is job checkpointing important? == |
| | 18 | |
| | 19 | * Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide]. |
| | 20 | * Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].) |
| | 21 | * A parallel MPI job can fail as soon as a single node in use crashes. |
| | 22 | * Cloud-based job queues with high availability can enforce the use of pre-emptible job queues. |
| | 23 | |
| | 24 | == Impacts of job checkpointing == |
| | 25 | |
| | 26 | * Job scripts require provision for requeuing after termination due to timing out or pre-emption. |
| | 27 | * Application software without built-in checkpointing require additional programming effort. |
| | 28 | * Runtime storing of complete program state at regular intervals requires additional time and I/O resources. |
| | 29 | |
| | 30 | == Software with built-in job checkpointing == |
| | 31 | |
| | 32 | See [wiki:Workshops/JobCheckpointing/SoftwareWithBuiltinCheckpointing Software With Built-in Checkpointing] |
| | 33 | |
| | 34 | == Job checkpointing examples == |
| | 35 | See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples] |
| | 36 | |