Changes between Version 2 and Version 3 of Workshops/JobCheckpointing


Ignore:
Timestamp:
01/20/26 16:14:47 (19 hours ago)
Author:
Carl Baribault
Comment:

Added concepts section

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing

    v2 v3  
    11[[PageOutline]]
    22= HPC Workshop Spring 2026 =
    3 = Module 8 of 8 - Job Checkpointing =
    4 Coming soon
     3= Module 8 of 8 - Job Checkpointing (Under construction) =
     4(content subject to change prior to the workshop)
     5
     6== What is Job Checkpointing? ==
     7Job checkpointing is the process of ensuring or providing for the job's ability to resume processing - without repeating some or all of the processing already performed after termination due to either end of requested walltime or system outage.
     8
     9== Why is Job Checkpointing important? ==
     10
     11* Checkpointed jobs can get started sooner out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
     12* A parallel MPI job can fail as soon as a single node in use crashes.
     13* Most production clusters enforce strict walltime limits. (See see [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
     14
     15== Impacts of Job Checkpointing ==
     16
     17* Job scripts require provision for requeuing after termination due to timing out or pre-emption.
     18* Applications without built-in checkpointing require additional programming effort.
     19* Runtime storing of complete program state at regular intervals requires additional time and I/O resources.
     20== Why use a HPC Cluster? ==
     21* '''tasks take too long'''
     22 * When the task to solve becomes heavy on computations, the operations are typically outsourced from the local laptop or desktop to elsewhere. 
     23 * Your computation may execute more efficiently if the code supports multithreading or multiprocessing.
     24 
     25* '''one server is not enough'''
     26 * When a single computer can’t handle the required computation or analysis, the work is carried out on larger groups of servers.
     27
     28== Job Checkpointing examples ==
     29* TBD