wiki:Workshops/JobCheckpointing/Examples

Version 3 (modified by Carl Baribault, 2 days ago) ( diff )

Removed under construction message

Job Checkpointing Examples

Python Example

Checkpointed, self restarting job

Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in Python.

Note that we're using the latest available Python module anaconda3/2023.07 in partition centos7.

#!/bin/bash
#SBATCH --job-name=checkpoint_example
#SBATCH --partition=centos7
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=24:00:00
#SBATCH --mem=16G

# --- Logging ---
#SBATCH --output=output.%j.out
#SBATCH --error=error.%j.err
#SBATCH --open-mode=append

# --- Enable automatic requeue ---
#SBATCH --requeue

# --- Send SIGTERM 2 minutes before walltime ---
#SBATCH --signal=TERM@120

set -euo pipefail

echo "Job started at $(date)"
echo "SLURM_JOB_ID = ${SLURM_JOB_ID}"
echo "SLURM_RESTART_COUNT = ${SLURM_RESTART_COUNT:-0}"

# ---------------------------------------------
# Application-specific configuration
# ---------------------------------------------
CHECKPOINT_DIR="$PWD/checkpoints"
CHECKPOINT_FILE="${CHECKPOINT_DIR}/state.chk"

mkdir -p "${CHECKPOINT_DIR}"

# ---------------------------------------------
# Launch the application
# ---------------------------------------------
# Your application must:
#  1) Load checkpoint if it exists
#  2) Catch SIGTERM
#  3) Write checkpoint
#  4) exit(99)

module load anaconda3/2023.07

srun ./my_simulation.py \
    --checkpoint "${CHECKPOINT_FILE}"

EXIT_CODE=$?

echo "Application exited with code ${EXIT_CODE}"

# ---------------------------------------------
# Restart logic
# ---------------------------------------------
if [[ ${EXIT_CODE} -eq 0 ]]; then
    echo "INFO: Job completed successfully"
    exit 0

elif [[ ${EXIT_CODE} -eq 99 ]]; then
    echo "INFO: Checkpoint written, requeuing job"
    scontrol requeue "${SLURM_JOB_ID}"
    exit 0

else
    echo "ERROR: Job failed with unexpected exit code"
    exit ${EXIT_CODE}
fi

Checkpointed application in Python

Here is an accompanying, minimal working example of a checkpointed application for Python in file my_simulation.py.

#!/usr/bin/env python3
import signal
import sys
import time
import os
import json

CHECKPOINT_FILE = "checkpoints/state.chk"

def save_checkpoint(i):
    os.makedirs("checkpoints", exist_ok=True)
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump({"step": i}, f)

def load_checkpoint():
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE, "r") as f:
            return json.load(f)["step"]
    return 0

def term_handler(signum, frame):
    print("SIGTERM received — saving checkpoint")
    save_checkpoint(current_step)
    sys.exit(99)  # <- special "requeue me" code

signal.signal(signal.SIGTERM, term_handler)

current_step = load_checkpoint()
print(f"Resuming from step {current_step}")

for i in range(current_step, 1_000_000):
    current_step = i
    time.sleep(1)  # simulate work

R Example

Checkpointed, self restarting job

Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in R.

#!/bin/bash
#SBATCH --job-name=r_checkpoint_demo
#SBATCH --partition=centos7
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --output=output.%j.out
#SBATCH --error=error.%j.err
#SBATCH --open-mode=append
#SBATCH --requeue
#SBATCH --signal=TERM@120   # send SIGTERM 120s before walltime

set -euo pipefail

mkdir -p logs checkpoints

# Trap SIGTERM from SLURM: create a file flag that R checks for
trap 'echo "SIGTERM received, creating TERM.flag"; touch TERM.flag' TERM

echo "Starting R checkpointing run at $(date)"

# load the R module
module load R/4.4.1

# Run the R script under srun
srun Rscript checkpoint.R || rc=$? || rc=0

# srun exit code
rc=${rc:-0}
echo "R exited with code: $rc"

if [[ $rc -eq 0 ]]; then
  echo "INFO: Finished successfully."
  exit 0
elif [[ $rc -eq 99 ]]; then
  echo "INFO: Checkpoint written (exit 99). Requeuing job..."
  rm -f TERM.flag
  scontrol requeue "$SLURM_JOB_ID"
  exit 0
else
  echo "ERROR: Unexpected failure (code $rc)."
  exit $rc
fi

Checkpointed application in Python

Here is an accompanying, minimal working example of a checkpointed application for R in file checkpoint.R.

#!/usr/bin/env Rscript

# Simple checkpointing/resume pattern for long runs in R.
# - Saves state as checkpoints/state.rds
# - Auto-resumes if that file exists
# - Periodically checkpoints every N iterations
# - If a TERM flag (created by SLURM trap) is detected, saves and exits(99)

checkpoint_file <- "checkpoints/state.rds"
term_flag       <- "TERM.flag"        # created by the shell trap
dir.create("checkpoints", showWarnings = FALSE, recursive = TRUE)

# --- Parameters you can tune ---
max_steps            <- 1e6L
checkpoint_every_n   <- 200L     # save every N iterations
sleep_seconds        <- 0.05     # simulate work
verbose              <- TRUE

# --- Load or initialize state ---
state <- list(step = 0L, results = numeric())
if (file.exists(checkpoint_file)) {
  if (verbose) cat("Resuming from checkpoint:", checkpoint_file, "\n")
  state <- readRDS(checkpoint_file)
} else {
  if (verbose) cat("Starting fresh run\n")
}

# --- Utility: save checkpoint ---
save_checkpoint <- function(st) {
  saveRDS(st, checkpoint_file)
  if (verbose) {
    cat(sprintf("Checkpoint saved at step %d -> %s\n", st$step, checkpoint_file))
  }
}

# --- Main work loop ---
for (i in seq.int(state$step + 1L, max_steps)) {
  state$step <- i

  # Simulate "work" (replace with your compute kernel)
  # e.g., update some running statistic
  x <- sin(i * 0.001) + rnorm(1, sd = 0.01)
  state$results <- c(state$results, x)
  if (sleep_seconds > 0) Sys.sleep(sleep_seconds)

  # Periodic checkpoint
  if ((i %% checkpoint_every_n) == 0L) {
    save_checkpoint(state)
  }

  # Respect pre-timeout signal from SLURM via a file flag
  if (file.exists(term_flag)) {
    cat("TERM flag detected. Saving final checkpoint and exiting with code 99.\n")
    save_checkpoint(state)
    quit(status = 99, save = "no")
  }
}

# Finished normally
save_checkpoint(state)
cat("Completed all steps. Exiting with code 0.\n")
quit(status = 0, save = "no")
Note: See TracWiki for help on using the wiki.