| Version 2 (modified by , 2 days ago) ( diff ) |
|---|
Job Checkpointing Examples (Under construction)
(content subject to change prior to the workshop)
Python Example
Checkpointed, self restarting job
Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in Python.
Note that we're using the latest available Python module anaconda3/2023.07 in partition centos7.
#!/bin/bash
#SBATCH --job-name=checkpoint_example
#SBATCH --partition=centos7
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=24:00:00
#SBATCH --mem=16G
# --- Logging ---
#SBATCH --output=output.%j.out
#SBATCH --error=error.%j.err
#SBATCH --open-mode=append
# --- Enable automatic requeue ---
#SBATCH --requeue
# --- Send SIGTERM 2 minutes before walltime ---
#SBATCH --signal=TERM@120
set -euo pipefail
echo "Job started at $(date)"
echo "SLURM_JOB_ID = ${SLURM_JOB_ID}"
echo "SLURM_RESTART_COUNT = ${SLURM_RESTART_COUNT:-0}"
# ---------------------------------------------
# Application-specific configuration
# ---------------------------------------------
CHECKPOINT_DIR="$PWD/checkpoints"
CHECKPOINT_FILE="${CHECKPOINT_DIR}/state.chk"
mkdir -p "${CHECKPOINT_DIR}"
# ---------------------------------------------
# Launch the application
# ---------------------------------------------
# Your application must:
# 1) Load checkpoint if it exists
# 2) Catch SIGTERM
# 3) Write checkpoint
# 4) exit(99)
module load anaconda3/2023.07
srun ./my_simulation.py \
--checkpoint "${CHECKPOINT_FILE}"
EXIT_CODE=$?
echo "Application exited with code ${EXIT_CODE}"
# ---------------------------------------------
# Restart logic
# ---------------------------------------------
if [[ ${EXIT_CODE} -eq 0 ]]; then
echo "INFO: Job completed successfully"
exit 0
elif [[ ${EXIT_CODE} -eq 99 ]]; then
echo "INFO: Checkpoint written, requeuing job"
scontrol requeue "${SLURM_JOB_ID}"
exit 0
else
echo "ERROR: Job failed with unexpected exit code"
exit ${EXIT_CODE}
fi
Checkpointed application in Python
Here is an accompanying, minimal working example of a checkpointed application for Python in file my_simulation.py.
#!/usr/bin/env python3
import signal
import sys
import time
import os
import json
CHECKPOINT_FILE = "checkpoints/state.chk"
def save_checkpoint(i):
os.makedirs("checkpoints", exist_ok=True)
with open(CHECKPOINT_FILE, "w") as f:
json.dump({"step": i}, f)
def load_checkpoint():
if os.path.exists(CHECKPOINT_FILE):
with open(CHECKPOINT_FILE, "r") as f:
return json.load(f)["step"]
return 0
def term_handler(signum, frame):
print("SIGTERM received — saving checkpoint")
save_checkpoint(current_step)
sys.exit(99) # <- special "requeue me" code
signal.signal(signal.SIGTERM, term_handler)
current_step = load_checkpoint()
print(f"Resuming from step {current_step}")
for i in range(current_step, 1_000_000):
current_step = i
time.sleep(1) # simulate work
R Example
Checkpointed, self restarting job
Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in R.
#!/bin/bash
#SBATCH --job-name=r_checkpoint_demo
#SBATCH --partition=centos7
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --output=output.%j.out
#SBATCH --error=error.%j.err
#SBATCH --open-mode=append
#SBATCH --requeue
#SBATCH --signal=TERM@120 # send SIGTERM 120s before walltime
set -euo pipefail
mkdir -p logs checkpoints
# Trap SIGTERM from SLURM: create a file flag that R checks for
trap 'echo "SIGTERM received, creating TERM.flag"; touch TERM.flag' TERM
echo "Starting R checkpointing run at $(date)"
# load the R module
module load R/4.4.1
# Run the R script under srun
srun Rscript checkpoint.R || rc=$? || rc=0
# srun exit code
rc=${rc:-0}
echo "R exited with code: $rc"
if [[ $rc -eq 0 ]]; then
echo "INFO: Finished successfully."
exit 0
elif [[ $rc -eq 99 ]]; then
echo "INFO: Checkpoint written (exit 99). Requeuing job..."
rm -f TERM.flag
scontrol requeue "$SLURM_JOB_ID"
exit 0
else
echo "ERROR: Unexpected failure (code $rc)."
exit $rc
fi
Checkpointed application in Python
Here is an accompanying, minimal working example of a checkpointed application for R in file checkpoint.R.
#!/usr/bin/env Rscript
# Simple checkpointing/resume pattern for long runs in R.
# - Saves state as checkpoints/state.rds
# - Auto-resumes if that file exists
# - Periodically checkpoints every N iterations
# - If a TERM flag (created by SLURM trap) is detected, saves and exits(99)
checkpoint_file <- "checkpoints/state.rds"
term_flag <- "TERM.flag" # created by the shell trap
dir.create("checkpoints", showWarnings = FALSE, recursive = TRUE)
# --- Parameters you can tune ---
max_steps <- 1e6L
checkpoint_every_n <- 200L # save every N iterations
sleep_seconds <- 0.05 # simulate work
verbose <- TRUE
# --- Load or initialize state ---
state <- list(step = 0L, results = numeric())
if (file.exists(checkpoint_file)) {
if (verbose) cat("Resuming from checkpoint:", checkpoint_file, "\n")
state <- readRDS(checkpoint_file)
} else {
if (verbose) cat("Starting fresh run\n")
}
# --- Utility: save checkpoint ---
save_checkpoint <- function(st) {
saveRDS(st, checkpoint_file)
if (verbose) {
cat(sprintf("Checkpoint saved at step %d -> %s\n", st$step, checkpoint_file))
}
}
# --- Main work loop ---
for (i in seq.int(state$step + 1L, max_steps)) {
state$step <- i
# Simulate "work" (replace with your compute kernel)
# e.g., update some running statistic
x <- sin(i * 0.001) + rnorm(1, sd = 0.01)
state$results <- c(state$results, x)
if (sleep_seconds > 0) Sys.sleep(sleep_seconds)
# Periodic checkpoint
if ((i %% checkpoint_every_n) == 0L) {
save_checkpoint(state)
}
# Respect pre-timeout signal from SLURM via a file flag
if (file.exists(term_flag)) {
cat("TERM flag detected. Saving final checkpoint and exiting with code 99.\n")
save_checkpoint(state)
quit(status = 99, save = "no")
}
}
# Finished normally
save_checkpoint(state)
cat("Completed all steps. Exiting with code 0.\n")
quit(status = 0, save = "no")
