BASH Checkpointing Example
Checkpoint Runner
See Checkpoint Runner for the contents of the job script file checkpoint_runner.sh.
BASH Checkpointing Application
checkpoint_signal_iter.sh
#!/usr/bin/env bash
# checkpoint_signal_iter.sh
set -euo pipefail
CKPT="${CKPT_PATH:-state_iter.txt}"
EVERY="${CHECKPOINT_EVERY:-20}"
MAX_ITER="${MAX_ITER:-500}"
i=0
save() {
# atomic write: tmp + mv
local val="$1"
local dir; dir="$(dirname "$(readlink -f "$CKPT")")"
mkdir -p "$dir"
local tmp; tmp="$(mktemp -p "$dir" .ckpt.XXXXXX)"
printf '%s\n' "$val" > "$tmp"
# Optional (broad) flush for older coreutils: uncomment if you want it
# sync # (no -d on coreutils 8.4)
mv -f "$tmp" "$CKPT"
}
# --- load checkpoint if present ---
if [[ -f "$CKPT" ]]; then
if [[ "$(cat "$CKPT" 2>/dev/null | tr -d '\r\n' | sed -e 's/[^0-9]//g')" != "" ]]; then
i="$(cat "$CKPT")"
fi
fi
# --- handle TERM: save current i and exit(99) to trigger requeue ---
term_handler() {
echo "Bash SIGTERM: saving i=${i} and exiting 99"
save "$i"
exit 99
}
trap 'term_handler' TERM
echo "Resuming from i=${i} (every ${EVERY}, MAX_ITER=${MAX_ITER})"
while true; do
i=$(( i + 1 ))
sleep 1
if (( i % EVERY == 0 )); then
save "$i"
echo "[periodic/iter] saved i=${i}"
fi
if (( i > MAX_ITER )); then
echo "Reached i=${i} > ${MAX_ITER}; exiting 0"
save "$i"
exit 0
fi
Running BASH checkpointing example on Cypress
To run the BASH checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.
- Edit the files checkpoint_runner.sh and checkpoint_signal_iter.sh in your current directory. For file editing with nano, etc., see File Editing Example.
- Submit the job via the following command.
[tulaneID@cypress1 ~]$ APP_CMD="./checkpoint_signal_iter.sh" CKPT_PATH=state_iter_bash.txt sbatch checkpoint_runner.sh
- Monitor the job's output via the following command, substituting the job ID for <jobID>.
[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
- Here are normal results for the output and error files, log_<jobID>.err and log_<jobID>.out, observing that the job cancelled and requeued itself many times.
[tulaneID@cypress1 ~]$cat log_3300698.err Terminated slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:20:11 *** Terminated slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:22:27 *** Terminated slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:24:57 *** Terminated slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:27:27 ***
[tulaneID@cypress1 ~]$cat log_3300698.out Info[20260313-22:18:11]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=0 Info[20260313-22:18:11]: Settings: Info[20260313-22:18:11]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:18:11]: APP_CMD=./checkpoint_signal_iter.sh Info[20260313-22:18:11]: LAUNCH_MODE=direct Info[20260313-22:18:11]: SRUN_ARGS=-n 1 Info[20260313-22:18:11]: TIME_LIMIT=00:03:00 Info[20260313-22:18:11]: MARGIN_SEC=60 Info[20260313-22:18:11]: CKPT_PATH=state_iter_bash.txt Info[20260313-22:18:11]: CHECKPOINT_EVERY=20 Info[20260313-22:18:11]: MAX_ITER=500 Info[20260313-22:18:11]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300698 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:18:10 EndTime=2026-03-13T22:21:10 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=0 (every 20, MAX_ITER=500) [periodic/iter] saved i=20 [periodic/iter] saved i=40 [periodic/iter] saved i=60 [periodic/iter] saved i=80 [periodic/iter] saved i=100 Bash SIGTERM: saving i=120 and exiting 99 Info[20260313-22:20:11]: Program exit code (from timeout wrapper): 124 Info[20260313-22:20:11]: Timeout TERM observed; checkpoint advanced (0->120). Requeueing... Info[20260313-22:20:11]: Requeued via scontrol. Info[20260313-22:20:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=1 Info[20260313-22:20:27]: Settings: Info[20260313-22:20:27]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:20:27]: APP_CMD=./checkpoint_signal_iter.sh Info[20260313-22:20:27]: LAUNCH_MODE=direct Info[20260313-22:20:27]: SRUN_ARGS=-n 1 Info[20260313-22:20:27]: TIME_LIMIT=00:03:00 Info[20260313-22:20:27]: MARGIN_SEC=60 Info[20260313-22:20:27]: CKPT_PATH=state_iter_bash.txt Info[20260313-22:20:27]: CHECKPOINT_EVERY=20 Info[20260313-22:20:27]: MAX_ITER=500 Info[20260313-22:20:27]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300698 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:20:27 EndTime=2026-03-13T22:23:27 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=120 (every 20, MAX_ITER=500) [periodic/iter] saved i=140 [periodic/iter] saved i=160 [periodic/iter] saved i=180 [periodic/iter] saved i=200 [periodic/iter] saved i=220 Bash SIGTERM: saving i=240 and exiting 99 Info[20260313-22:22:27]: Program exit code (from timeout wrapper): 124 Info[20260313-22:22:27]: Timeout TERM observed; checkpoint advanced (120->240). Requeueing... Info[20260313-22:22:27]: Requeued via scontrol. Info[20260313-22:22:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=2 Info[20260313-22:22:57]: Settings: Info[20260313-22:22:57]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:22:57]: APP_CMD=./checkpoint_signal_iter.sh Info[20260313-22:22:57]: LAUNCH_MODE=direct Info[20260313-22:22:57]: SRUN_ARGS=-n 1 Info[20260313-22:22:57]: TIME_LIMIT=00:03:00 Info[20260313-22:22:57]: MARGIN_SEC=60 Info[20260313-22:22:57]: CKPT_PATH=state_iter_bash.txt Info[20260313-22:22:57]: CHECKPOINT_EVERY=20 Info[20260313-22:22:57]: MAX_ITER=500 Info[20260313-22:22:57]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300698 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:22:57 EndTime=2026-03-13T22:25:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=240 (every 20, MAX_ITER=500) [periodic/iter] saved i=260 [periodic/iter] saved i=280 [periodic/iter] saved i=300 [periodic/iter] saved i=320 [periodic/iter] saved i=340 Bash SIGTERM: saving i=360 and exiting 99 Info[20260313-22:24:57]: Program exit code (from timeout wrapper): 124 Info[20260313-22:24:57]: Timeout TERM observed; checkpoint advanced (240->360). Requeueing... Info[20260313-22:24:57]: Requeued via scontrol. Info[20260313-22:25:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=3 Info[20260313-22:25:27]: Settings: Info[20260313-22:25:27]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:25:27]: APP_CMD=./checkpoint_signal_iter.sh Info[20260313-22:25:27]: LAUNCH_MODE=direct Info[20260313-22:25:27]: SRUN_ARGS=-n 1 Info[20260313-22:25:27]: TIME_LIMIT=00:03:00 Info[20260313-22:25:27]: MARGIN_SEC=60 Info[20260313-22:25:27]: CKPT_PATH=state_iter_bash.txt Info[20260313-22:25:27]: CHECKPOINT_EVERY=20 Info[20260313-22:25:27]: MAX_ITER=500 Info[20260313-22:25:27]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300698 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:25:27 EndTime=2026-03-13T22:28:27 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=360 (every 20, MAX_ITER=500) [periodic/iter] saved i=380 [periodic/iter] saved i=400 [periodic/iter] saved i=420 [periodic/iter] saved i=440 [periodic/iter] saved i=460 Bash SIGTERM: saving i=480 and exiting 99 Info[20260313-22:27:27]: Program exit code (from timeout wrapper): 124 Info[20260313-22:27:27]: Timeout TERM observed; checkpoint advanced (360->480). Requeueing... Info[20260313-22:27:27]: Requeued via scontrol. Info[20260313-22:27:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=4 Info[20260313-22:27:57]: Settings: Info[20260313-22:27:57]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:27:57]: APP_CMD=./checkpoint_signal_iter.sh Info[20260313-22:27:57]: LAUNCH_MODE=direct Info[20260313-22:27:57]: SRUN_ARGS=-n 1 Info[20260313-22:27:57]: TIME_LIMIT=00:03:00 Info[20260313-22:27:57]: MARGIN_SEC=60 Info[20260313-22:27:57]: CKPT_PATH=state_iter_bash.txt Info[20260313-22:27:57]: CHECKPOINT_EVERY=20 Info[20260313-22:27:57]: MAX_ITER=500 Info[20260313-22:27:57]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300698 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:27:57 EndTime=2026-03-13T22:30:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=480 (every 20, MAX_ITER=500) [periodic/iter] saved i=500 Reached i=501 > 500; exiting 0 Info[20260313-22:28:18]: Program exit code (from timeout wrapper): 0 Info[20260313-22:28:18]: Completed.
Last modified
2 days ago
Last modified on 03/13/2026 11:58:35 PM
Note:
See TracWiki
for help on using the wiki.
