Changes between Version 4 and Version 5 of Workshops/JobCheckpointing


Ignore:
Timestamp:
01/21/26 12:36:24 (4 hours ago)
Author:
Carl Baribault
Comment:

Moved "why…HPC" first, added links for software-with-built-in, job-examples

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing

    v4 v5  
    44(content subject to change prior to the workshop)
    55
    6 == What is Job Checkpointing? ==
    7 Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
    8 
    9 == Why is Job Checkpointing important? ==
    10 
    11 * Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
    12 * Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
    13 * A parallel MPI job can fail as soon as a single node in use crashes.
    14 * Cloud-based job queues with high availability can enforce the use of pre-emptible job queues.
    15 
    16 == Impacts of Job Checkpointing ==
    17 
    18 * Job scripts require provision for requeuing after termination due to timing out or pre-emption.
    19 * Applications without built-in checkpointing require additional programming effort.
    20 * Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
    216== Why use a HPC Cluster? ==
    227* '''tasks take too long'''
     
    2712 * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.
    2813
    29 == Job Checkpointing examples ==
    30 * TBD
     14== What is job checkpointing? ==
     15Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
     16
     17== Why is job checkpointing important? ==
     18
     19* Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
     20* Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
     21* A parallel MPI job can fail as soon as a single node in use crashes.
     22* Cloud-based job queues with high availability can enforce the use of pre-emptible job queues.
     23
     24== Impacts of job checkpointing ==
     25
     26* Job scripts require provision for requeuing after termination due to timing out or pre-emption.
     27* Application software without built-in checkpointing require additional programming effort.
     28* Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
     29
     30== Software with built-in job checkpointing ==
     31
     32See [wiki:Workshops/JobCheckpointing/SoftwareWithBuiltinCheckpointing Software With Built-in Checkpointing]
     33
     34== Job checkpointing examples ==
     35See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples]
     36