[[PageOutline]] = Job Checkpointing Examples (Under construction) = (content subject to change prior to the workshop) == Python Example == === Checkpointed, self restarting job === Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in Python. Note that we're using the latest available Python '''module anaconda3/2023.07''' in partition '''centos7'''. {{{ #!/bin/bash #SBATCH --job-name=checkpoint_example #SBATCH --partition=centos7 #SBATCH --qos=normal #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --time=24:00:00 #SBATCH --mem=16G # --- Logging --- #SBATCH --output=output.%j.out #SBATCH --error=error.%j.err #SBATCH --open-mode=append # --- Enable automatic requeue --- #SBATCH --requeue # --- Send SIGTERM 2 minutes before walltime --- #SBATCH --signal=TERM@120 set -euo pipefail echo "Job started at $(date)" echo "SLURM_JOB_ID = ${SLURM_JOB_ID}" echo "SLURM_RESTART_COUNT = ${SLURM_RESTART_COUNT:-0}" # --------------------------------------------- # Application-specific configuration # --------------------------------------------- CHECKPOINT_DIR="$PWD/checkpoints" CHECKPOINT_FILE="${CHECKPOINT_DIR}/state.chk" mkdir -p "${CHECKPOINT_DIR}" # --------------------------------------------- # Launch the application # --------------------------------------------- # Your application must: # 1) Load checkpoint if it exists # 2) Catch SIGTERM # 3) Write checkpoint # 4) exit(99) module load anaconda3/2023.07 srun ./my_simulation.py \ --checkpoint "${CHECKPOINT_FILE}" EXIT_CODE=$? echo "Application exited with code ${EXIT_CODE}" # --------------------------------------------- # Restart logic # --------------------------------------------- if [[ ${EXIT_CODE} -eq 0 ]]; then echo "INFO: Job completed successfully" exit 0 elif [[ ${EXIT_CODE} -eq 99 ]]; then echo "INFO: Checkpoint written, requeuing job" scontrol requeue "${SLURM_JOB_ID}" exit 0 else echo "ERROR: Job failed with unexpected exit code" exit ${EXIT_CODE} fi }}} === Checkpointed application in Python === Here is an accompanying, minimal working example of a checkpointed application for Python in file '''my_simulation.py'''. {{{ #!/usr/bin/env python3 import signal import sys import time import os import json CHECKPOINT_FILE = "checkpoints/state.chk" def save_checkpoint(i): os.makedirs("checkpoints", exist_ok=True) with open(CHECKPOINT_FILE, "w") as f: json.dump({"step": i}, f) def load_checkpoint(): if os.path.exists(CHECKPOINT_FILE): with open(CHECKPOINT_FILE, "r") as f: return json.load(f)["step"] return 0 def term_handler(signum, frame): print("SIGTERM received — saving checkpoint") save_checkpoint(current_step) sys.exit(99) # <- special "requeue me" code signal.signal(signal.SIGTERM, term_handler) current_step = load_checkpoint() print(f"Resuming from step {current_step}") for i in range(current_step, 1_000_000): current_step = i time.sleep(1) # simulate work }}} == R Example == === Checkpointed, self restarting job === Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in R. {{{ #!/bin/bash #SBATCH --job-name=r_checkpoint_demo #SBATCH --partition=centos7 #SBATCH --time=24:00:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --mem=4G #SBATCH --output=output.%j.out #SBATCH --error=error.%j.err #SBATCH --open-mode=append #SBATCH --requeue #SBATCH --signal=TERM@120 # send SIGTERM 120s before walltime set -euo pipefail mkdir -p logs checkpoints # Trap SIGTERM from SLURM: create a file flag that R checks for trap 'echo "SIGTERM received, creating TERM.flag"; touch TERM.flag' TERM echo "Starting R checkpointing run at $(date)" # load the R module module load R/4.4.1 # Run the R script under srun srun Rscript checkpoint.R || rc=$? || rc=0 # srun exit code rc=${rc:-0} echo "R exited with code: $rc" if [[ $rc -eq 0 ]]; then echo "INFO: Finished successfully." exit 0 elif [[ $rc -eq 99 ]]; then echo "INFO: Checkpoint written (exit 99). Requeuing job..." rm -f TERM.flag scontrol requeue "$SLURM_JOB_ID" exit 0 else echo "ERROR: Unexpected failure (code $rc)." exit $rc fi }}} === Checkpointed application in Python === Here is an accompanying, minimal working example of a checkpointed application for R in file '''checkpoint.R'''. {{{ #!/usr/bin/env Rscript # Simple checkpointing/resume pattern for long runs in R. # - Saves state as checkpoints/state.rds # - Auto-resumes if that file exists # - Periodically checkpoints every N iterations # - If a TERM flag (created by SLURM trap) is detected, saves and exits(99) checkpoint_file <- "checkpoints/state.rds" term_flag <- "TERM.flag" # created by the shell trap dir.create("checkpoints", showWarnings = FALSE, recursive = TRUE) # --- Parameters you can tune --- max_steps <- 1e6L checkpoint_every_n <- 200L # save every N iterations sleep_seconds <- 0.05 # simulate work verbose <- TRUE # --- Load or initialize state --- state <- list(step = 0L, results = numeric()) if (file.exists(checkpoint_file)) { if (verbose) cat("Resuming from checkpoint:", checkpoint_file, "\n") state <- readRDS(checkpoint_file) } else { if (verbose) cat("Starting fresh run\n") } # --- Utility: save checkpoint --- save_checkpoint <- function(st) { saveRDS(st, checkpoint_file) if (verbose) { cat(sprintf("Checkpoint saved at step %d -> %s\n", st$step, checkpoint_file)) } } # --- Main work loop --- for (i in seq.int(state$step + 1L, max_steps)) { state$step <- i # Simulate "work" (replace with your compute kernel) # e.g., update some running statistic x <- sin(i * 0.001) + rnorm(1, sd = 0.01) state$results <- c(state$results, x) if (sleep_seconds > 0) Sys.sleep(sleep_seconds) # Periodic checkpoint if ((i %% checkpoint_every_n) == 0L) { save_checkpoint(state) } # Respect pre-timeout signal from SLURM via a file flag if (file.exists(term_flag)) { cat("TERM flag detected. Saving final checkpoint and exiting with code 99.\n") save_checkpoint(state) quit(status = 99, save = "no") } } # Finished normally save_checkpoint(state) cat("Completed all steps. Exiting with code 0.\n") quit(status = 0, save = "no") }}}