Changes between Version 9 and Version 10 of Workshops/JobCheckpointing


Ignore:
Timestamp:
03/13/2026 09:41:26 PM (2 days ago)
Author:
Carl Baribault
Comment:

Added sacctmgr reference

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing

    v9 v10  
    1818
    1919* '''Checkpointed jobs can get started sooner''' out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide].
    20 * '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters. (See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].)
     20* '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters.
     21 * See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].
     22 * See also the command
     23   * '''sacctmgr show qos format=Name,!MaxWall,!MaxNodesPerUser | grep -E "normal|long"'''
    2124* '''Checkpointing mitigates job failures due to node crashes''' - especially for long running parallel MPI jobs.
    2225* '''Checkpointed jobs can handle frequent job pre-emption''' - especially for certain cloud-based job queues with high availability.