wiki:Workshops/JobCheckpointing/Examples/BASH

BASH Checkpointing Example

Checkpoint Runner

See Checkpoint Runner for the contents of the job script file checkpoint_runner.sh.

BASH Checkpointing Application

checkpoint_signal_iter.sh

#!/usr/bin/env bash
# checkpoint_signal_iter.sh
set -euo pipefail

CKPT="${CKPT_PATH:-state_iter.txt}"
EVERY="${CHECKPOINT_EVERY:-20}"
MAX_ITER="${MAX_ITER:-500}"
i=0

save() {
  # atomic write: tmp + mv
  local val="$1"
  local dir; dir="$(dirname "$(readlink -f "$CKPT")")"
  mkdir -p "$dir"
  local tmp; tmp="$(mktemp -p "$dir" .ckpt.XXXXXX)"
  printf '%s\n' "$val" > "$tmp"
  # Optional (broad) flush for older coreutils: uncomment if you want it
  # sync   # (no -d on coreutils 8.4)
  mv -f "$tmp" "$CKPT"
}

# --- load checkpoint if present ---
if [[ -f "$CKPT" ]]; then
  if [[ "$(cat "$CKPT" 2>/dev/null | tr -d '\r\n' | sed -e 's/[^0-9]//g')" != "" ]]; then
    i="$(cat "$CKPT")"
  fi
fi

# --- handle TERM: save current i and exit(99) to trigger requeue ---
term_handler() {
  echo "Bash SIGTERM: saving i=${i} and exiting 99"
  save "$i"
  exit 99
}
trap 'term_handler' TERM

echo "Resuming from i=${i} (every ${EVERY}, MAX_ITER=${MAX_ITER})"
while true; do
  i=$(( i + 1 ))
  sleep 1

  if (( i % EVERY == 0 )); then
    save "$i"
    echo "[periodic/iter] saved i=${i}"
  fi

  if (( i > MAX_ITER )); then
    echo "Reached i=${i} > ${MAX_ITER}; exiting 0"
    save "$i"
    exit 0
  fi

Running BASH checkpointing example on Cypress

To run the BASH checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.

  1. Edit the files checkpoint_runner.sh and checkpoint_signal_iter.sh in your current directory. For file editing with nano, etc., see File Editing Example.
  1. Submit the job via the following command.
[tulaneID@cypress1 ~]$ APP_CMD="./checkpoint_signal_iter.sh" CKPT_PATH=state_iter_bash.txt sbatch checkpoint_runner.sh
  1. Monitor the job's output via the following command, substituting the job ID for <jobID>.
[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
  1. Here are normal results for the output and error files, log_<jobID>.err and log_<jobID>.out, observing that the job cancelled and requeued itself many times.
[tulaneID@cypress1 ~]$cat log_3300698.err
Terminated
slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:20:11 ***
Terminated
slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:22:27 ***
Terminated
slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:24:57 ***
Terminated
slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:27:27 ***

[tulaneID@cypress1 ~]$cat log_3300698.out
Info[20260313-22:18:11]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=0
Info[20260313-22:18:11]: Settings:
Info[20260313-22:18:11]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:18:11]: APP_CMD=./checkpoint_signal_iter.sh
Info[20260313-22:18:11]: LAUNCH_MODE=direct
Info[20260313-22:18:11]: SRUN_ARGS=-n 1
Info[20260313-22:18:11]: TIME_LIMIT=00:03:00
Info[20260313-22:18:11]: MARGIN_SEC=60
Info[20260313-22:18:11]: CKPT_PATH=state_iter_bash.txt
Info[20260313-22:18:11]: CHECKPOINT_EVERY=20
Info[20260313-22:18:11]: MAX_ITER=500
Info[20260313-22:18:11]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300698 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:18:10 EndTime=2026-03-13T22:21:10
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=0 (every 20, MAX_ITER=500)
[periodic/iter] saved i=20
[periodic/iter] saved i=40
[periodic/iter] saved i=60
[periodic/iter] saved i=80
[periodic/iter] saved i=100
Bash SIGTERM: saving i=120 and exiting 99
Info[20260313-22:20:11]: Program exit code (from timeout wrapper): 124
Info[20260313-22:20:11]: Timeout TERM observed; checkpoint advanced (0->120). Requeueing...
Info[20260313-22:20:11]: Requeued via scontrol.
Info[20260313-22:20:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=1
Info[20260313-22:20:27]: Settings:
Info[20260313-22:20:27]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:20:27]: APP_CMD=./checkpoint_signal_iter.sh
Info[20260313-22:20:27]: LAUNCH_MODE=direct
Info[20260313-22:20:27]: SRUN_ARGS=-n 1
Info[20260313-22:20:27]: TIME_LIMIT=00:03:00
Info[20260313-22:20:27]: MARGIN_SEC=60
Info[20260313-22:20:27]: CKPT_PATH=state_iter_bash.txt
Info[20260313-22:20:27]: CHECKPOINT_EVERY=20
Info[20260313-22:20:27]: MAX_ITER=500
Info[20260313-22:20:27]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300698 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:20:27 EndTime=2026-03-13T22:23:27
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=120 (every 20, MAX_ITER=500)
[periodic/iter] saved i=140
[periodic/iter] saved i=160
[periodic/iter] saved i=180
[periodic/iter] saved i=200
[periodic/iter] saved i=220
Bash SIGTERM: saving i=240 and exiting 99
Info[20260313-22:22:27]: Program exit code (from timeout wrapper): 124
Info[20260313-22:22:27]: Timeout TERM observed; checkpoint advanced (120->240). Requeueing...
Info[20260313-22:22:27]: Requeued via scontrol.
Info[20260313-22:22:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=2
Info[20260313-22:22:57]: Settings:
Info[20260313-22:22:57]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:22:57]: APP_CMD=./checkpoint_signal_iter.sh
Info[20260313-22:22:57]: LAUNCH_MODE=direct
Info[20260313-22:22:57]: SRUN_ARGS=-n 1
Info[20260313-22:22:57]: TIME_LIMIT=00:03:00
Info[20260313-22:22:57]: MARGIN_SEC=60
Info[20260313-22:22:57]: CKPT_PATH=state_iter_bash.txt
Info[20260313-22:22:57]: CHECKPOINT_EVERY=20
Info[20260313-22:22:57]: MAX_ITER=500
Info[20260313-22:22:57]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300698 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:22:57 EndTime=2026-03-13T22:25:57
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=240 (every 20, MAX_ITER=500)
[periodic/iter] saved i=260
[periodic/iter] saved i=280
[periodic/iter] saved i=300
[periodic/iter] saved i=320
[periodic/iter] saved i=340
Bash SIGTERM: saving i=360 and exiting 99
Info[20260313-22:24:57]: Program exit code (from timeout wrapper): 124
Info[20260313-22:24:57]: Timeout TERM observed; checkpoint advanced (240->360). Requeueing...
Info[20260313-22:24:57]: Requeued via scontrol.
Info[20260313-22:25:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=3
Info[20260313-22:25:27]: Settings:
Info[20260313-22:25:27]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:25:27]: APP_CMD=./checkpoint_signal_iter.sh
Info[20260313-22:25:27]: LAUNCH_MODE=direct
Info[20260313-22:25:27]: SRUN_ARGS=-n 1
Info[20260313-22:25:27]: TIME_LIMIT=00:03:00
Info[20260313-22:25:27]: MARGIN_SEC=60
Info[20260313-22:25:27]: CKPT_PATH=state_iter_bash.txt
Info[20260313-22:25:27]: CHECKPOINT_EVERY=20
Info[20260313-22:25:27]: MAX_ITER=500
Info[20260313-22:25:27]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300698 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:25:27 EndTime=2026-03-13T22:28:27
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=360 (every 20, MAX_ITER=500)
[periodic/iter] saved i=380
[periodic/iter] saved i=400
[periodic/iter] saved i=420
[periodic/iter] saved i=440
[periodic/iter] saved i=460
Bash SIGTERM: saving i=480 and exiting 99
Info[20260313-22:27:27]: Program exit code (from timeout wrapper): 124
Info[20260313-22:27:27]: Timeout TERM observed; checkpoint advanced (360->480). Requeueing...
Info[20260313-22:27:27]: Requeued via scontrol.
Info[20260313-22:27:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=4
Info[20260313-22:27:57]: Settings:
Info[20260313-22:27:57]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:27:57]: APP_CMD=./checkpoint_signal_iter.sh
Info[20260313-22:27:57]: LAUNCH_MODE=direct
Info[20260313-22:27:57]: SRUN_ARGS=-n 1
Info[20260313-22:27:57]: TIME_LIMIT=00:03:00
Info[20260313-22:27:57]: MARGIN_SEC=60
Info[20260313-22:27:57]: CKPT_PATH=state_iter_bash.txt
Info[20260313-22:27:57]: CHECKPOINT_EVERY=20
Info[20260313-22:27:57]: MAX_ITER=500
Info[20260313-22:27:57]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300698 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:27:57 EndTime=2026-03-13T22:30:57
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=480 (every 20, MAX_ITER=500)
[periodic/iter] saved i=500
Reached i=501 > 500; exiting 0
Info[20260313-22:28:18]: Program exit code (from timeout wrapper): 0
Info[20260313-22:28:18]: Completed.
Last modified 2 days ago Last modified on 03/13/2026 11:58:35 PM
Note: See TracWiki for help on using the wiki.