Changes between Version 5 and Version 6 of Workshops/JobCheckpointing


Ignore:
Timestamp:
01/22/26 11:08:11 (2 days ago)
Author:
Carl Baribault
Comment:

Added detail to checkpointing process

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing

    v5 v6  
    1313
    1414== What is job checkpointing? ==
    15 Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
     15Job checkpointing is the process of ensuring or programming the job's application software with the ability to save partial results as well as resume processing after termination at end of requested walltime (walltime termination).
    1616
    17 == Why is job checkpointing important? ==
     17A checkpointed job '''must''' be able to perform the following.
     18
     19* The application must record it's progress (see the following) at one or both of the following times.
     20 * '''At regular time intervals''' on its own (This option is preferred for very long - or sufficiently long - running jobs where system crashes are more likely.) - '''or'''
     21 * '''After catching a Signal Terminate signal''' (or '''SIGTERM''') from the Operating System, where the signal is programmed '''in the job script''' to allow for completion of recording before walltime termination. For example, in your job script...
     22  * either via sbatch directives
     23{{{
     24   # --- Append to output and error files ---
     25   #SBATCH --open-mode=append
     26   # --- Enable automatic requeue ---
     27   #SBATCH --requeue
     28   # --- Send SIGTERM 2 minutes before walltime ---
     29   #SBATCH --signal=TERM@120
     30}}}
     31  * or bash '''timeout''' command followed by '''requeue'''
     32{{{
     33timeout 23h ./my_simulation || scontrol requeue $SLURM_JOB_ID
     34}}}
     35
     36* The application must record both the work already performed as well as the current or recent '''state of execution''' or '''state'''.
     37
     38* When the job is '''requeued''', the application must read the recorded '''state''' and resume from that point of execution with the previous work preserved.
     39
     40== Why is job checkpointing important - and beneficial? ==
    1841
    1942* Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
    20 * Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
    21 * A parallel MPI job can fail as soon as a single node in use crashes.
    22 * Cloud-based job queues with high availability can enforce the use of pre-emptible job queues.
     43* Checkpointed jobs compensate for strict walltime limits enforced by Cypress, LONI, and most other production clusters. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
     44* Checkpointed jobs running parallel MPI (especially long running jobs recording at regular intervals) can fail as soon as a single node in use crashes.
     45* Checkpointed jobs running in certain cloud-based job queues with high availability can experience strictly enforced job pre-emption (SIGTERM signals).
    2346
    24 == Impacts of job checkpointing ==
     47== What are the impacts of job checkpointing? ==
    2548
    2649* Job scripts require provision for requeuing after termination due to timing out or pre-emption.