Changes between Initial Version and Version 1 of Workshops/JobCheckpointing/Examples/BASH


Ignore:
Timestamp:
03/13/2026 11:31:04 PM (2 days ago)
Author:
Carl Baribault
Comment:

Intial version BASH checkpointing example

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing/Examples/BASH

    v1 v1  
     1[[PageOutline]]
     2= BASH Checkpointing Example =
     3
     4== Checkpoint Runner ==
     5
     6See [wiki:Workshops/JobCheckpointing/Examples#CheckpointRunner Checkpoint Runner] for the contents of the job script file '''checkpoint_runner.sh'''.
     7
     8== BASH Checkpointing Application ==
     9
     10=== checkpoint_signal_iter.sh ===
     11
     12{{{
     13#!/usr/bin/env bash
     14# checkpoint_signal_iter.sh
     15set -euo pipefail
     16
     17CKPT="${CKPT_PATH:-state_iter.txt}"
     18EVERY="${CHECKPOINT_EVERY:-20}"
     19MAX_ITER="${MAX_ITER:-500}"
     20i=0
     21
     22save() {
     23  # atomic write: tmp + mv
     24  local val="$1"
     25  local dir; dir="$(dirname "$(readlink -f "$CKPT")")"
     26  mkdir -p "$dir"
     27  local tmp; tmp="$(mktemp -p "$dir" .ckpt.XXXXXX)"
     28  printf '%s\n' "$val" > "$tmp"
     29  # Optional (broad) flush for older coreutils: uncomment if you want it
     30  # sync   # (no -d on coreutils 8.4)
     31  mv -f "$tmp" "$CKPT"
     32}
     33
     34# --- load checkpoint if present ---
     35if [[ -f "$CKPT" ]]; then
     36  if [[ "$(cat "$CKPT" 2>/dev/null | tr -d '\r\n' | sed -e 's/[^0-9]//g')" != "" ]]; then
     37    i="$(cat "$CKPT")"
     38  fi
     39fi
     40
     41# --- handle TERM: save current i and exit(99) to trigger requeue ---
     42term_handler() {
     43  echo "Bash SIGTERM: saving i=${i} and exiting 99"
     44  save "$i"
     45  exit 99
     46}
     47trap 'term_handler' TERM
     48
     49echo "Resuming from i=${i} (every ${EVERY}, MAX_ITER=${MAX_ITER})"
     50while true; do
     51  i=$(( i + 1 ))
     52  sleep 1
     53
     54  if (( i % EVERY == 0 )); then
     55    save "$i"
     56    echo "[periodic/iter] saved i=${i}"
     57  fi
     58
     59  if (( i > MAX_ITER )); then
     60    echo "Reached i=${i} > ${MAX_ITER}; exiting 0"
     61    save "$i"
     62    exit 0
     63  fi
     64}}}
     65
     66== Running BASH checkpointing example on Cypress ==
     67
     68To run the BASH checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.
     69
     701. Edit the files '''checkpoint_runner.sh''' and '''checkpoint_signal_iter.sh''' in your current directory.
     71 For file editing with nano, etc., see [[https://wiki.hpc.tulane.edu/trac/wiki/cypress/FileEditingSoftware/Example|File Editing Example]].
     72
     732. Submit the job via the following command.
     74
     75{{{
     76[tulaneID@cypress1 ~]$ APP_CMD="./checkpoint_signal_iter.sh" CKPT_PATH=state_iter_bash.txt sbatch checkpoint_runner.sh
     77}}}
     78
     792. Monitor the job's output via the following command, substituting the job ID for <jobID>.
     80
     81{{{
     82[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
     83}}}
     84
     853. Here are normal results for the output and error files, '''log_<jobID>.err''' and  '''log_<jobID>.out''', observing that the job cancelled and requeued itself many times.
     86
     87{{{
     88[tulaneID@cypress1 ~]$cat log_3300698.err
     89Terminated
     90slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:20:11 ***
     91Terminated
     92slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:22:27 ***
     93Terminated
     94slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:24:57 ***
     95Terminated
     96slurmstepd: *** JOB 3300698 CANCELLED AT 2026-03-13T22:27:27 ***
     97}}}
     98 
     99{{{
     100[tulaneID@cypress1 ~]$cat log_3300698.groupID.out
     101Info[20260313-22:18:11]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=0
     102Info[20260313-22:18:11]: Settings:
     103Info[20260313-22:18:11]: MODULE_LIST=anaconda3/2023.07
     104Info[20260313-22:18:11]: APP_CMD=./checkpoint_signal_iter.sh
     105Info[20260313-22:18:11]: LAUNCH_MODE=direct
     106Info[20260313-22:18:11]: SRUN_ARGS=-n 1
     107Info[20260313-22:18:11]: TIME_LIMIT=00:03:00
     108Info[20260313-22:18:11]: MARGIN_SEC=60
     109Info[20260313-22:18:11]: CKPT_PATH=state_iter_bash.txt
     110Info[20260313-22:18:11]: CHECKPOINT_EVERY=20
     111Info[20260313-22:18:11]: MAX_ITER=500
     112Info[20260313-22:18:11]: MAX_RESTARTS=10
     113=== BEGIN JOB SNAPSHOT (scontrol) ===
     114JobId=3300698 Name=ckpt_requeue_demo
     115   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     116   JobState=RUNNING Reason=None Dependency=(null)
     117   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
     118   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
     119   StartTime=2026-03-13T22:18:10 EndTime=2026-03-13T22:21:10
     120   Partition=centos7 AllocNode:Sid=cypress2:33768
     121=== END JOB SNAPSHOT (scontrol) ===
     122Resuming from i=0 (every 20, MAX_ITER=500)
     123[periodic/iter] saved i=20
     124[periodic/iter] saved i=40
     125[periodic/iter] saved i=60
     126[periodic/iter] saved i=80
     127[periodic/iter] saved i=100
     128Bash SIGTERM: saving i=120 and exiting 99
     129Info[20260313-22:20:11]: Program exit code (from timeout wrapper): 124
     130Info[20260313-22:20:11]: Timeout TERM observed; checkpoint advanced (0→120). Requeueing...
     131Info[20260313-22:20:11]: Requeued via scontrol.
     132Info[20260313-22:20:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=1
     133Info[20260313-22:20:27]: Settings:
     134Info[20260313-22:20:27]: MODULE_LIST=anaconda3/2023.07
     135Info[20260313-22:20:27]: APP_CMD=./checkpoint_signal_iter.sh
     136Info[20260313-22:20:27]: LAUNCH_MODE=direct
     137Info[20260313-22:20:27]: SRUN_ARGS=-n 1
     138Info[20260313-22:20:27]: TIME_LIMIT=00:03:00
     139Info[20260313-22:20:27]: MARGIN_SEC=60
     140Info[20260313-22:20:27]: CKPT_PATH=state_iter_bash.txt
     141Info[20260313-22:20:27]: CHECKPOINT_EVERY=20
     142Info[20260313-22:20:27]: MAX_ITER=500
     143Info[20260313-22:20:27]: MAX_RESTARTS=10
     144=== BEGIN JOB SNAPSHOT (scontrol) ===
     145JobId=3300698 Name=ckpt_requeue_demo
     146   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     147   JobState=RUNNING Reason=None Dependency=(null)
     148   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
     149   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     150   StartTime=2026-03-13T22:20:27 EndTime=2026-03-13T22:23:27
     151   Partition=centos7 AllocNode:Sid=cypress2:33768
     152=== END JOB SNAPSHOT (scontrol) ===
     153Resuming from i=120 (every 20, MAX_ITER=500)
     154[periodic/iter] saved i=140
     155[periodic/iter] saved i=160
     156[periodic/iter] saved i=180
     157[periodic/iter] saved i=200
     158[periodic/iter] saved i=220
     159Bash SIGTERM: saving i=240 and exiting 99
     160Info[20260313-22:22:27]: Program exit code (from timeout wrapper): 124
     161Info[20260313-22:22:27]: Timeout TERM observed; checkpoint advanced (120→240). Requeueing...
     162Info[20260313-22:22:27]: Requeued via scontrol.
     163Info[20260313-22:22:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=2
     164Info[20260313-22:22:57]: Settings:
     165Info[20260313-22:22:57]: MODULE_LIST=anaconda3/2023.07
     166Info[20260313-22:22:57]: APP_CMD=./checkpoint_signal_iter.sh
     167Info[20260313-22:22:57]: LAUNCH_MODE=direct
     168Info[20260313-22:22:57]: SRUN_ARGS=-n 1
     169Info[20260313-22:22:57]: TIME_LIMIT=00:03:00
     170Info[20260313-22:22:57]: MARGIN_SEC=60
     171Info[20260313-22:22:57]: CKPT_PATH=state_iter_bash.txt
     172Info[20260313-22:22:57]: CHECKPOINT_EVERY=20
     173Info[20260313-22:22:57]: MAX_ITER=500
     174Info[20260313-22:22:57]: MAX_RESTARTS=10
     175=== BEGIN JOB SNAPSHOT (scontrol) ===
     176JobId=3300698 Name=ckpt_requeue_demo
     177   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     178   JobState=RUNNING Reason=None Dependency=(null)
     179   Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
     180   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     181   StartTime=2026-03-13T22:22:57 EndTime=2026-03-13T22:25:57
     182   Partition=centos7 AllocNode:Sid=cypress2:33768
     183=== END JOB SNAPSHOT (scontrol) ===
     184Resuming from i=240 (every 20, MAX_ITER=500)
     185[periodic/iter] saved i=260
     186[periodic/iter] saved i=280
     187[periodic/iter] saved i=300
     188[periodic/iter] saved i=320
     189[periodic/iter] saved i=340
     190Bash SIGTERM: saving i=360 and exiting 99
     191Info[20260313-22:24:57]: Program exit code (from timeout wrapper): 124
     192Info[20260313-22:24:57]: Timeout TERM observed; checkpoint advanced (240→360). Requeueing...
     193Info[20260313-22:24:57]: Requeued via scontrol.
     194Info[20260313-22:25:27]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=3
     195Info[20260313-22:25:27]: Settings:
     196Info[20260313-22:25:27]: MODULE_LIST=anaconda3/2023.07
     197Info[20260313-22:25:27]: APP_CMD=./checkpoint_signal_iter.sh
     198Info[20260313-22:25:27]: LAUNCH_MODE=direct
     199Info[20260313-22:25:27]: SRUN_ARGS=-n 1
     200Info[20260313-22:25:27]: TIME_LIMIT=00:03:00
     201Info[20260313-22:25:27]: MARGIN_SEC=60
     202Info[20260313-22:25:27]: CKPT_PATH=state_iter_bash.txt
     203Info[20260313-22:25:27]: CHECKPOINT_EVERY=20
     204Info[20260313-22:25:27]: MAX_ITER=500
     205Info[20260313-22:25:27]: MAX_RESTARTS=10
     206=== BEGIN JOB SNAPSHOT (scontrol) ===
     207JobId=3300698 Name=ckpt_requeue_demo
     208   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     209   JobState=RUNNING Reason=None Dependency=(null)
     210   Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0
     211   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     212   StartTime=2026-03-13T22:25:27 EndTime=2026-03-13T22:28:27
     213   Partition=centos7 AllocNode:Sid=cypress2:33768
     214=== END JOB SNAPSHOT (scontrol) ===
     215Resuming from i=360 (every 20, MAX_ITER=500)
     216[periodic/iter] saved i=380
     217[periodic/iter] saved i=400
     218[periodic/iter] saved i=420
     219[periodic/iter] saved i=440
     220[periodic/iter] saved i=460
     221Bash SIGTERM: saving i=480 and exiting 99
     222Info[20260313-22:27:27]: Program exit code (from timeout wrapper): 124
     223Info[20260313-22:27:27]: Timeout TERM observed; checkpoint advanced (360→480). Requeueing...
     224Info[20260313-22:27:27]: Requeued via scontrol.
     225Info[20260313-22:27:57]: Start on cypress01-066; JOB_ID=3300698; RESTARTS=4
     226Info[20260313-22:27:57]: Settings:
     227Info[20260313-22:27:57]: MODULE_LIST=anaconda3/2023.07
     228Info[20260313-22:27:57]: APP_CMD=./checkpoint_signal_iter.sh
     229Info[20260313-22:27:57]: LAUNCH_MODE=direct
     230Info[20260313-22:27:57]: SRUN_ARGS=-n 1
     231Info[20260313-22:27:57]: TIME_LIMIT=00:03:00
     232Info[20260313-22:27:57]: MARGIN_SEC=60
     233Info[20260313-22:27:57]: CKPT_PATH=state_iter_bash.txt
     234Info[20260313-22:27:57]: CHECKPOINT_EVERY=20
     235Info[20260313-22:27:57]: MAX_ITER=500
     236Info[20260313-22:27:57]: MAX_RESTARTS=10
     237=== BEGIN JOB SNAPSHOT (scontrol) ===
     238JobId=3300698 Name=ckpt_requeue_demo
     239   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     240   JobState=RUNNING Reason=None Dependency=(null)
     241   Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
     242   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     243   StartTime=2026-03-13T22:27:57 EndTime=2026-03-13T22:30:57
     244   Partition=centos7 AllocNode:Sid=cypress2:33768
     245=== END JOB SNAPSHOT (scontrol) ===
     246Resuming from i=480 (every 20, MAX_ITER=500)
     247[periodic/iter] saved i=500
     248Reached i=501 > 500; exiting 0
     249Info[20260313-22:28:18]: Program exit code (from timeout wrapper): 0
     250Info[20260313-22:28:18]: Completed.
     251}}}