| | 1 | [[PageOutline]] |
| | 2 | = BASH Checkpointing Example = |
| | 3 | |
| | 4 | == Checkpoint Runner == |
| | 5 | |
| | 6 | See [wiki:Workshops/JobCheckpointing/Examples#CheckpointRunner Checkpoint Runner] for the contents of the job script file '''checkpoint_runner.sh'''. |
| | 7 | |
| | 8 | == BASH Checkpointing Application == |
| | 9 | |
| | 10 | === checkpoint_signal_iter.sh === |
| | 11 | |
| | 12 | {{{ |
| | 13 | #!/usr/bin/env bash |
| | 14 | # checkpoint_signal_iter.sh |
| | 15 | set -euo pipefail |
| | 16 | |
| | 17 | CKPT="${CKPT_PATH:-state_iter.txt}" |
| | 18 | EVERY="${CHECKPOINT_EVERY:-20}" |
| | 19 | MAX_ITER="${MAX_ITER:-500}" |
| | 20 | i=0 |
| | 21 | |
| | 22 | save() { |
| | 23 | # atomic write: tmp + mv |
| | 24 | local val="$1" |
| | 25 | local dir; dir="$(dirname "$(readlink -f "$CKPT")")" |
| | 26 | mkdir -p "$dir" |
| | 27 | local tmp; tmp="$(mktemp -p "$dir" .ckpt.XXXXXX)" |
| | 28 | printf '%s\n' "$val" > "$tmp" |
| | 29 | # Optional (broad) flush for older coreutils: uncomment if you want it |
| | 30 | # sync # (no -d on coreutils 8.4) |
| | 31 | mv -f "$tmp" "$CKPT" |
| | 32 | } |
| | 33 | |
| | 34 | # --- load checkpoint if present --- |
| | 35 | if [[ -f "$CKPT" ]]; then |
| | 36 | if [[ "$(cat "$CKPT" 2>/dev/null | tr -d '\r\n' | sed -e 's/[^0-9]//g')" != "" ]]; then |
| | 37 | i="$(cat "$CKPT")" |
| | 38 | fi |
| | 39 | fi |
| | 40 | |
| | 41 | # --- handle TERM: save current i and exit(99) to trigger requeue --- |
| | 42 | term_handler() { |
| | 43 | echo "Bash SIGTERM: saving i=${i} and exiting 99" |
| | 44 | save "$i" |
| | 45 | exit 99 |
| | 46 | } |
| | 47 | trap 'term_handler' TERM |
| | 48 | |
| | 49 | echo "Resuming from i=${i} (every ${EVERY}, MAX_ITER=${MAX_ITER})" |
| | 50 | while true; do |
| | 51 | i=$(( i + 1 )) |
| | 52 | sleep 1 |
| | 53 | |
| | 54 | if (( i % EVERY == 0 )); then |
| | 55 | save "$i" |
| | 56 | echo "[periodic/iter] saved i=${i}" |
| | 57 | fi |
| | 58 | |
| | 59 | if (( i > MAX_ITER )); then |
| | 60 | echo "Reached i=${i} > ${MAX_ITER}; exiting 0" |
| | 61 | save "$i" |
| | 62 | exit 0 |
| | 63 | fi |
| | 64 | }}} |
| | 65 | |
| | 66 | == Running BASH checkpointing example on Cypress == |
| | 67 | |
| | 68 | To run the BASH checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following. |
| | 69 | |
| | 70 | 1. Edit the files '''checkpoint_runner.sh''' and '''checkpoint_signal_iter.sh''' in your current directory. |
| | 71 | For file editing with nano, etc., see [[https://wiki.hpc.tulane.edu/trac/wiki/cypress/FileEditingSoftware/Example|File Editing Example]]. |
| | 72 | |
| | 73 | 2. Submit the job via the following command. |
| | 74 | |
| | 75 | {{{ |
| | 76 | [tulaneID@cypress1 ~]$ APP_CMD="./checkpoint_signal_iter.sh" CKPT_PATH=state_iter_bash.txt sbatch checkpoint_runner.sh |
| | 77 | }}} |
| | 78 | |
| | 79 | 2. Monitor the job's output via the following command, substituting the job ID for <jobID>. |
| | 80 | |
| | 81 | {{{ |
| | 82 | [tulaneID@cypress1 ~]$ tail -f log_<jobID>.* |
| | 83 | }}} |
| | 84 | |
| | 85 | 3. Here are normal results for the output and error files, '''log_<jobID>.err''' and '''log_<jobID>.out''', observing that the job cancelled and requeued itself many times. |
| | 86 | |
| | 87 | {{{ |
| | 88 | [tulaneID@cypress1 ~]$cat log_3300698.err |
| | 89 | Terminated |
| | 90 | slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:20:11 *** |
| | 91 | Terminated |
| | 92 | slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:22:27 *** |
| | 93 | Terminated |
| | 94 | slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:24:57 *** |
| | 95 | Terminated |
| | 96 | slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:27:27 *** |
| | 97 | }}} |
| | 98 | |
| | 99 | {{{ |
| | 100 | [tulaneID@cypress1 ~]$cat log_3300698.groupID.out |
| | 101 | Info[20260313-22:18:11]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=0 |
| | 102 | Info[20260313-22:18:11]: Settings: |
| | 103 | Info[20260313-22:18:11]: MODULE_LIST=anaconda3/2023.07 |
| | 104 | Info[20260313-22:18:11]: APP_CMD=./checkpoint_signal_iter.sh |
| | 105 | Info[20260313-22:18:11]: LAUNCH_MODE=direct |
| | 106 | Info[20260313-22:18:11]: SRUN_ARGS=-n 1 |
| | 107 | Info[20260313-22:18:11]: TIME_LIMIT=00:03:00 |
| | 108 | Info[20260313-22:18:11]: MARGIN_SEC=60 |
| | 109 | Info[20260313-22:18:11]: CKPT_PATH=state_iter_bash.txt |
| | 110 | Info[20260313-22:18:11]: CHECKPOINT_EVERY=20 |
| | 111 | Info[20260313-22:18:11]: MAX_ITER=500 |
| | 112 | Info[20260313-22:18:11]: MAX_RESTARTS=10 |
| | 113 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 114 | JobId=3300698 Name=ckpt_requeue_demo |
| | 115 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 116 | JobState=RUNNING Reason=None Dependency=(null) |
| | 117 | Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 |
| | 118 | RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A |
| | 119 | StartTime=2026-03-13T22:18:10 EndTime=2026-03-13T22:21:10 |
| | 120 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 121 | === END JOB SNAPSHOT (scontrol) === |
| | 122 | Resuming from i=0 (every 20, MAX_ITER=500) |
| | 123 | [periodic/iter] saved i=20 |
| | 124 | [periodic/iter] saved i=40 |
| | 125 | [periodic/iter] saved i=60 |
| | 126 | [periodic/iter] saved i=80 |
| | 127 | [periodic/iter] saved i=100 |
| | 128 | Bash SIGTERM: saving i=120 and exiting 99 |
| | 129 | Info[20260313-22:20:11]: Program exit code (from timeout wrapper): 124 |
| | 130 | Info[20260313-22:20:11]: Timeout TERM observed; checkpoint advanced (0→120). Requeueing... |
| | 131 | Info[20260313-22:20:11]: Requeued via scontrol. |
| | 132 | Info[20260313-22:20:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=1 |
| | 133 | Info[20260313-22:20:27]: Settings: |
| | 134 | Info[20260313-22:20:27]: MODULE_LIST=anaconda3/2023.07 |
| | 135 | Info[20260313-22:20:27]: APP_CMD=./checkpoint_signal_iter.sh |
| | 136 | Info[20260313-22:20:27]: LAUNCH_MODE=direct |
| | 137 | Info[20260313-22:20:27]: SRUN_ARGS=-n 1 |
| | 138 | Info[20260313-22:20:27]: TIME_LIMIT=00:03:00 |
| | 139 | Info[20260313-22:20:27]: MARGIN_SEC=60 |
| | 140 | Info[20260313-22:20:27]: CKPT_PATH=state_iter_bash.txt |
| | 141 | Info[20260313-22:20:27]: CHECKPOINT_EVERY=20 |
| | 142 | Info[20260313-22:20:27]: MAX_ITER=500 |
| | 143 | Info[20260313-22:20:27]: MAX_RESTARTS=10 |
| | 144 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 145 | JobId=3300698 Name=ckpt_requeue_demo |
| | 146 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 147 | JobState=RUNNING Reason=None Dependency=(null) |
| | 148 | Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 |
| | 149 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 150 | StartTime=2026-03-13T22:20:27 EndTime=2026-03-13T22:23:27 |
| | 151 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 152 | === END JOB SNAPSHOT (scontrol) === |
| | 153 | Resuming from i=120 (every 20, MAX_ITER=500) |
| | 154 | [periodic/iter] saved i=140 |
| | 155 | [periodic/iter] saved i=160 |
| | 156 | [periodic/iter] saved i=180 |
| | 157 | [periodic/iter] saved i=200 |
| | 158 | [periodic/iter] saved i=220 |
| | 159 | Bash SIGTERM: saving i=240 and exiting 99 |
| | 160 | Info[20260313-22:22:27]: Program exit code (from timeout wrapper): 124 |
| | 161 | Info[20260313-22:22:27]: Timeout TERM observed; checkpoint advanced (120→240). Requeueing... |
| | 162 | Info[20260313-22:22:27]: Requeued via scontrol. |
| | 163 | Info[20260313-22:22:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=2 |
| | 164 | Info[20260313-22:22:57]: Settings: |
| | 165 | Info[20260313-22:22:57]: MODULE_LIST=anaconda3/2023.07 |
| | 166 | Info[20260313-22:22:57]: APP_CMD=./checkpoint_signal_iter.sh |
| | 167 | Info[20260313-22:22:57]: LAUNCH_MODE=direct |
| | 168 | Info[20260313-22:22:57]: SRUN_ARGS=-n 1 |
| | 169 | Info[20260313-22:22:57]: TIME_LIMIT=00:03:00 |
| | 170 | Info[20260313-22:22:57]: MARGIN_SEC=60 |
| | 171 | Info[20260313-22:22:57]: CKPT_PATH=state_iter_bash.txt |
| | 172 | Info[20260313-22:22:57]: CHECKPOINT_EVERY=20 |
| | 173 | Info[20260313-22:22:57]: MAX_ITER=500 |
| | 174 | Info[20260313-22:22:57]: MAX_RESTARTS=10 |
| | 175 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 176 | JobId=3300698 Name=ckpt_requeue_demo |
| | 177 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 178 | JobState=RUNNING Reason=None Dependency=(null) |
| | 179 | Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0 |
| | 180 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 181 | StartTime=2026-03-13T22:22:57 EndTime=2026-03-13T22:25:57 |
| | 182 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 183 | === END JOB SNAPSHOT (scontrol) === |
| | 184 | Resuming from i=240 (every 20, MAX_ITER=500) |
| | 185 | [periodic/iter] saved i=260 |
| | 186 | [periodic/iter] saved i=280 |
| | 187 | [periodic/iter] saved i=300 |
| | 188 | [periodic/iter] saved i=320 |
| | 189 | [periodic/iter] saved i=340 |
| | 190 | Bash SIGTERM: saving i=360 and exiting 99 |
| | 191 | Info[20260313-22:24:57]: Program exit code (from timeout wrapper): 124 |
| | 192 | Info[20260313-22:24:57]: Timeout TERM observed; checkpoint advanced (240→360). Requeueing... |
| | 193 | Info[20260313-22:24:57]: Requeued via scontrol. |
| | 194 | Info[20260313-22:25:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=3 |
| | 195 | Info[20260313-22:25:27]: Settings: |
| | 196 | Info[20260313-22:25:27]: MODULE_LIST=anaconda3/2023.07 |
| | 197 | Info[20260313-22:25:27]: APP_CMD=./checkpoint_signal_iter.sh |
| | 198 | Info[20260313-22:25:27]: LAUNCH_MODE=direct |
| | 199 | Info[20260313-22:25:27]: SRUN_ARGS=-n 1 |
| | 200 | Info[20260313-22:25:27]: TIME_LIMIT=00:03:00 |
| | 201 | Info[20260313-22:25:27]: MARGIN_SEC=60 |
| | 202 | Info[20260313-22:25:27]: CKPT_PATH=state_iter_bash.txt |
| | 203 | Info[20260313-22:25:27]: CHECKPOINT_EVERY=20 |
| | 204 | Info[20260313-22:25:27]: MAX_ITER=500 |
| | 205 | Info[20260313-22:25:27]: MAX_RESTARTS=10 |
| | 206 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 207 | JobId=3300698 Name=ckpt_requeue_demo |
| | 208 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 209 | JobState=RUNNING Reason=None Dependency=(null) |
| | 210 | Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0 |
| | 211 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 212 | StartTime=2026-03-13T22:25:27 EndTime=2026-03-13T22:28:27 |
| | 213 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 214 | === END JOB SNAPSHOT (scontrol) === |
| | 215 | Resuming from i=360 (every 20, MAX_ITER=500) |
| | 216 | [periodic/iter] saved i=380 |
| | 217 | [periodic/iter] saved i=400 |
| | 218 | [periodic/iter] saved i=420 |
| | 219 | [periodic/iter] saved i=440 |
| | 220 | [periodic/iter] saved i=460 |
| | 221 | Bash SIGTERM: saving i=480 and exiting 99 |
| | 222 | Info[20260313-22:27:27]: Program exit code (from timeout wrapper): 124 |
| | 223 | Info[20260313-22:27:27]: Timeout TERM observed; checkpoint advanced (360→480). Requeueing... |
| | 224 | Info[20260313-22:27:27]: Requeued via scontrol. |
| | 225 | Info[20260313-22:27:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=4 |
| | 226 | Info[20260313-22:27:57]: Settings: |
| | 227 | Info[20260313-22:27:57]: MODULE_LIST=anaconda3/2023.07 |
| | 228 | Info[20260313-22:27:57]: APP_CMD=./checkpoint_signal_iter.sh |
| | 229 | Info[20260313-22:27:57]: LAUNCH_MODE=direct |
| | 230 | Info[20260313-22:27:57]: SRUN_ARGS=-n 1 |
| | 231 | Info[20260313-22:27:57]: TIME_LIMIT=00:03:00 |
| | 232 | Info[20260313-22:27:57]: MARGIN_SEC=60 |
| | 233 | Info[20260313-22:27:57]: CKPT_PATH=state_iter_bash.txt |
| | 234 | Info[20260313-22:27:57]: CHECKPOINT_EVERY=20 |
| | 235 | Info[20260313-22:27:57]: MAX_ITER=500 |
| | 236 | Info[20260313-22:27:57]: MAX_RESTARTS=10 |
| | 237 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 238 | JobId=3300698 Name=ckpt_requeue_demo |
| | 239 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 240 | JobState=RUNNING Reason=None Dependency=(null) |
| | 241 | Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0 |
| | 242 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 243 | StartTime=2026-03-13T22:27:57 EndTime=2026-03-13T22:30:57 |
| | 244 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 245 | === END JOB SNAPSHOT (scontrol) === |
| | 246 | Resuming from i=480 (every 20, MAX_ITER=500) |
| | 247 | [periodic/iter] saved i=500 |
| | 248 | Reached i=501 > 500; exiting 0 |
| | 249 | Info[20260313-22:28:18]: Program exit code (from timeout wrapper): 0 |
| | 250 | Info[20260313-22:28:18]: Completed. |
| | 251 | }}} |