Changes between Version 9 and Version 10 of Workshops/JobCheckpointing
- Timestamp:
- 03/13/2026 09:41:26 PM (2 days ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Workshops/JobCheckpointing
v9 v10 18 18 19 19 * '''Checkpointed jobs can get started sooner''' out of the job queue pending state with a reduced requested run time. (See "backfill scheduling" in [https://slurm.schedmd.com/sched_config.html|SLURM Scheduling Configuration Guide]. 20 * '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters. (See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)].) 20 * '''More checkpointed jobs can run simultaneously''' due to strict limits enforced by Cypress, LONI, and most other production clusters. 21 * See see '''--qos=normal''' in [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)]. 22 * See also the command 23 * '''sacctmgr show qos format=Name,!MaxWall,!MaxNodesPerUser | grep -E "normal|long"''' 21 24 * '''Checkpointing mitigates job failures due to node crashes''' - especially for long running parallel MPI jobs. 22 25 * '''Checkpointed jobs can handle frequent job pre-emption''' - especially for certain cloud-based job queues with high availability.
