Changes between Version 8 and Version 9 of Workshops/JobCheckpointing


Ignore:
Timestamp:
03/13/2026 09:16:32 PM (2 days ago)
Author:
Carl Baribault
Comment:

Simplified introduction

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing

    v8 v9  
    1212
    1313== What is job checkpointing? ==
    14 Job checkpointing is the process of ensuring or programming the job's application software with the ability to save partial results as well as resume processing after termination at end of requested walltime (walltime termination).
     14=== Recovery from logical or external interruption ===
     15Job checkpointing is the job's recovering from being interrupted either by its own logic or externally.
    1516
    16 A checkpointed job '''must''' be able to perform the following.
     17== Pro's of job checkpointing ==
    1718
    18 * The application must record it's progress (see the following) at one or both of the following times.
    19  * '''At regular time intervals''' on its own (This option is preferred for very long - or sufficiently long - running jobs where system crashes are more likely.) - '''or'''
    20  * '''After catching a Signal Terminate signal''' (or '''SIGTERM''') from the Operating System, where the signal is programmed '''in the job script''' to allow for completion of recording before walltime termination. For example, in your job script...
    21   * either via sbatch directives
    22 {{{
    23    # --- Append to output and error files ---
    24    #SBATCH --open-mode=append
    25    # --- Enable automatic requeue ---
    26    #SBATCH --requeue
    27    # --- Send SIGTERM 2 minutes before walltime ---
    28    #SBATCH --signal=TERM@120
    29 }}}
    30   * or bash '''timeout''' command followed by '''requeue'''
    31 {{{
    32 timeout 23h ./my_simulation || scontrol requeue $SLURM_JOB_ID
    33 }}}
     19* '''Checkpointed jobs can get started sooner''' out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
     20* '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters. (See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
     21* '''Checkpointing mitigates job failures due to node crashes''' - especially for long running parallel MPI jobs.
     22* '''Checkpointed jobs can handle frequent job pre-emption''' - especially for certain cloud-based job queues with high availability.
    3423
    35 * The application must record both the work already performed as well as the current or recent '''state of execution''' or '''state'''.
     24== Con's of job checkpointing ==
    3625
    37 * When the job is '''requeued''', the application must read the recorded '''state''' and resume from that point of execution with the previous work preserved.
     26Checkpointing requires a level of coordination between the job script and the job's application software in order to perform the following. See [wiki:Workshops/JobCheckpointing/Examples Job Checkpointing Examples]
    3827
    39 == Why is job checkpointing important - and beneficial? ==
    40 
    41 * Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
    42 * Checkpointed jobs compensate for strict walltime limits enforced by Cypress, LONI, and most other production clusters. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
    43 * Checkpointed jobs running parallel MPI (especially long running jobs recording at regular intervals) can fail as soon as a single node in use crashes.
    44 * Checkpointed jobs running in certain cloud-based job queues with high availability can experience strictly enforced job pre-emption (SIGTERM signals).
    45 * On Cypress, as an example, there are many more nodes available for multi-node jobs with 24-hour time limit. See [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].
    46 
    47 == What are the impacts of job checkpointing? ==
    48 
    49 * Job scripts require provision for requeuing after termination due to timing out or pre-emption.
    50 * Application software without built-in checkpointing require additional programming effort.
    51 * Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
     28* '''Recording execution progress'''...
     29 * at regular time intervals '''and/or'''
     30 * after catching a Signal Terminate signal (or '''SIGTERM''') from the Operating System.
     31 * '''Execution progress includes'''...
     32  * the intermediate results '''and'''
     33  * the current '''state of execution''' or '''state'''.
     34* '''Requeue''' itself (or otherwise re-submit) as needed after interruption.
     35* '''Resume''' from the recorded state after being restarted from the job queue.
    5236
    5337== Software with built-in job checkpointing ==