wiki:Workshops/JobCheckpointing/Examples

Context Navigation

Version 2 (modified by Carl Baribault, 7 weeks ago) ( diff )
Filled in Python & R examples

Job Checkpointing Examples (Under construction)
1. Python Example
  1. Checkpointed, self restarting job
  2. Checkpointed application in Python
2. R Example
  1. Checkpointed, self restarting job
  2. Checkpointed application in Python

Job Checkpointing Examples (Under construction)

(content subject to change prior to the workshop)

Python Example

Checkpointed, self restarting job

Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in Python.

Note that we're using the latest available Python module anaconda3/2023.07 in partition centos7.

#!/bin/bash
#SBATCH --job-name=checkpoint_example
#SBATCH --partition=centos7
#SBATCH --qos=normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=24:00:00
#SBATCH --mem=16G

# --- Logging ---
#SBATCH --output=output.%j.out
#SBATCH --error=error.%j.err
#SBATCH --open-mode=append

# --- Enable automatic requeue ---
#SBATCH --requeue

# --- Send SIGTERM 2 minutes before walltime ---
#SBATCH --signal=TERM@120

set -euo pipefail

echo "Job started at $(date)"
echo "SLURM_JOB_ID = ${SLURM_JOB_ID}"
echo "SLURM_RESTART_COUNT = ${SLURM_RESTART_COUNT:-0}"

# ---------------------------------------------
# Application-specific configuration
# ---------------------------------------------
CHECKPOINT_DIR="$PWD/checkpoints"
CHECKPOINT_FILE="${CHECKPOINT_DIR}/state.chk"

mkdir -p "${CHECKPOINT_DIR}"

# ---------------------------------------------
# Launch the application
# ---------------------------------------------
# Your application must:
#  1) Load checkpoint if it exists
#  2) Catch SIGTERM
#  3) Write checkpoint
#  4) exit(99)

module load anaconda3/2023.07

srun ./my_simulation.py \
    --checkpoint "${CHECKPOINT_FILE}"

EXIT_CODE=$?

echo "Application exited with code ${EXIT_CODE}"

# ---------------------------------------------
# Restart logic
# ---------------------------------------------
if [[ ${EXIT_CODE} -eq 0 ]]; then
    echo "INFO: Job completed successfully"
    exit 0

elif [[ ${EXIT_CODE} -eq 99 ]]; then
    echo "INFO: Checkpoint written, requeuing job"
    scontrol requeue "${SLURM_JOB_ID}"
    exit 0

else
    echo "ERROR: Job failed with unexpected exit code"
    exit ${EXIT_CODE}
fi

Checkpointed application in Python

Here is an accompanying, minimal working example of a checkpointed application for Python in file my_simulation.py.

#!/usr/bin/env python3
import signal
import sys
import time
import os
import json

CHECKPOINT_FILE = "checkpoints/state.chk"

def save_checkpoint(i):
    os.makedirs("checkpoints", exist_ok=True)
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump({"step": i}, f)

def load_checkpoint():
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE, "r") as f:
            return json.load(f)["step"]
    return 0

def term_handler(signum, frame):
    print("SIGTERM received — saving checkpoint")
    save_checkpoint(current_step)
    sys.exit(99)  # <- special "requeue me" code

signal.signal(signal.SIGTERM, term_handler)

current_step = load_checkpoint()
print(f"Resuming from step {current_step}")

for i in range(current_step, 1_000_000):
    current_step = i
    time.sleep(1)  # simulate work

R Example

Checkpointed, self restarting job

Here is a fully self restarting job and, further below, the accompanying, minimal working example of a checkpointed application in R.

#!/bin/bash
#SBATCH --job-name=r_checkpoint_demo
#SBATCH --partition=centos7
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --output=output.%j.out
#SBATCH --error=error.%j.err
#SBATCH --open-mode=append
#SBATCH --requeue
#SBATCH --signal=TERM@120   # send SIGTERM 120s before walltime

set -euo pipefail

mkdir -p logs checkpoints

# Trap SIGTERM from SLURM: create a file flag that R checks for
trap 'echo "SIGTERM received, creating TERM.flag"; touch TERM.flag' TERM

echo "Starting R checkpointing run at $(date)"

# load the R module
module load R/4.4.1

# Run the R script under srun
srun Rscript checkpoint.R || rc=$? || rc=0

# srun exit code
rc=${rc:-0}
echo "R exited with code: $rc"

if [[ $rc -eq 0 ]]; then
  echo "INFO: Finished successfully."
  exit 0
elif [[ $rc -eq 99 ]]; then
  echo "INFO: Checkpoint written (exit 99). Requeuing job..."
  rm -f TERM.flag
  scontrol requeue "$SLURM_JOB_ID"
  exit 0
else
  echo "ERROR: Unexpected failure (code $rc)."
  exit $rc
fi

Checkpointed application in Python

Here is an accompanying, minimal working example of a checkpointed application for R in file checkpoint.R.

#!/usr/bin/env Rscript

# Simple checkpointing/resume pattern for long runs in R.
# - Saves state as checkpoints/state.rds
# - Auto-resumes if that file exists
# - Periodically checkpoints every N iterations
# - If a TERM flag (created by SLURM trap) is detected, saves and exits(99)

checkpoint_file <- "checkpoints/state.rds"
term_flag       <- "TERM.flag"        # created by the shell trap
dir.create("checkpoints", showWarnings = FALSE, recursive = TRUE)

# --- Parameters you can tune ---
max_steps            <- 1e6L
checkpoint_every_n   <- 200L     # save every N iterations
sleep_seconds        <- 0.05     # simulate work
verbose              <- TRUE

# --- Load or initialize state ---
state <- list(step = 0L, results = numeric())
if (file.exists(checkpoint_file)) {
  if (verbose) cat("Resuming from checkpoint:", checkpoint_file, "\n")
  state <- readRDS(checkpoint_file)
} else {
  if (verbose) cat("Starting fresh run\n")
}

# --- Utility: save checkpoint ---
save_checkpoint <- function(st) {
  saveRDS(st, checkpoint_file)
  if (verbose) {
    cat(sprintf("Checkpoint saved at step %d -> %s\n", st$step, checkpoint_file))
  }
}

# --- Main work loop ---
for (i in seq.int(state$step + 1L, max_steps)) {
  state$step <- i

  # Simulate "work" (replace with your compute kernel)
  # e.g., update some running statistic
  x <- sin(i * 0.001) + rnorm(1, sd = 0.01)
  state$results <- c(state$results, x)
  if (sleep_seconds > 0) Sys.sleep(sleep_seconds)

  # Periodic checkpoint
  if ((i %% checkpoint_every_n) == 0L) {
    save_checkpoint(state)
  }

  # Respect pre-timeout signal from SLURM via a file flag
  if (file.exists(term_flag)) {
    cat("TERM flag detected. Saving final checkpoint and exiting with code 99.\n")
    save_checkpoint(state)
    quit(status = 99, save = "no")
  }
}

# Finished normally
save_checkpoint(state)
cat("Completed all steps. Exiting with code 0.\n")
quit(status = 0, save = "no")

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text