[[PageOutline]] = R Checkpointing Example = == Checkpoint Runner == See [wiki:Workshops/JobCheckpointing/Examples#CheckpointRunner Checkpoint Runner] for the contents of the job script file '''checkpoint_runner.sh'''. == R Checkpointing Application == === checkpoint_signal_iter.R === {{{ #!/usr/bin/env Rscript # checkpoint_signal_iter.R (requires 'sigterm' package) suppressWarnings(suppressMessages({ if (!requireNamespace("sigterm", quietly = TRUE)) { message("ERROR: 'sigterm' package not installed; install via remotes/devtools.") quit(save="no", status=2L) } })) get_env <- function(k, d) { v <- Sys.getenv(k, unset=NA); if (is.na(v) || v=="") d else v } CKPT <- get_env("CKPT_PATH", "state_iter.txt") EVERY <- as.integer(get_env("CHECKPOINT_EVERY", "20")) MAX_ITER <- as.integer(get_env("MAX_ITER", "500")) atomic_save <- function(path, val) { dir <- dirname(normalizePath(path, mustWork = FALSE)) if (!dir.exists(dir)) dir.create(dir, recursive = TRUE, showWarnings = FALSE) tmp <- tempfile(pattern = ".ckpt.", tmpdir = dir) con <- file(tmp, open="wt"); writeLines(as.character(val), con); flush(con); close(con) file.rename(tmp, path) } load_ckpt <- function(path) { if (!file.exists(path)) return(0L) txt <- tryCatch(readLines(path, warn = FALSE), error = function(e) "0") as.integer(gsub("[^0-9]", "", paste(txt, collapse = ""))) } library(sigterm) # installs a SIGTERM handler; poll has_sigterm_flag() i <- load_ckpt(CKPT) cat(sprintf("R: Resuming from i=%d (every %d, MAX_ITER=%d)\n", i, EVERY, MAX_ITER)); flush(stdout()) repeat { i <- i + 1L Sys.sleep(1) if (i %% EVERY == 0L) { atomic_save(CKPT, i) cat(sprintf("[periodic/iter] saved i=%d\n", i)); flush(stdout()) } if (sigterm::has_sigterm_flag()) { cat(sprintf("R: SIGTERM detected — saving i=%d and exiting 99\n", i)); flush(stdout()) atomic_save(CKPT, i) quit(save="no", status=99L) } if (i > MAX_ITER) { cat(sprintf("Reached i=%d > %d; exiting 0\n", i, MAX_ITER)); flush(stdout()) atomic_save(CKPT, i) quit(save="no", status=0L) } } }}} == Running R checkpointing example on Cypress == To run the R checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following. 1. Edit the files '''checkpoint_runner.sh''' and '''checkpoint_signal_iter.R''' in your current directory. For file editing with nano, etc., see [wiki:cypress/FileEditingSoftware/Example File Editing Example]. 2. Use the steps in the section [wiki:cypress/RunningRStudioWithSingularity#Alternativefortidyverse Alternative for tidyverse] for constructing a container image file '''tidyverse_latest.sif''' (or '''tidyverse_4.5.2.sif''' as in the following) for use with Singularity. 3. Install the required R package '''sigterm''' via the following, substituting your own writable R library directory for . See [wiki:cypress/InstallingRPackages Installing R Packages on Cypress] for more information on using a writable R library directory. {{{ [tulaneID@cypress1 ~]$idev --partition=centos7 [tulaneID@cypress01-XXX ~]$module load singularity/3.9.0 [tulaneID@cypress01-XXX ~]$singularity shell tidyverse_4.5.2.sif Singularity> Rscript --version # confirm Rscript is available Rscript (R) version 4.5.2 (2025-10-31) Singularity>Rscript -e "devtools::install_github('atheriel/sigterm', lib='')" Singularity>exit # exit the container [tulaneID@cypress01-XXX ~]$exit # exit the interactive session }}} 4. Submit the job via the following command. {{{ [tulaneID@cypress1 ~]$APP_CMD="singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R" MODULE_LIST="singularity/3.9.0" CKPT_PATH=state_iter_r.txt sbatch checkpoint_runner.sh }}} 5. Monitor the job's output via the following command, substituting the job ID for . {{{ [tulaneID@cypress1 ~]$ tail -f log_.* }}} 6. Here are normal results for the output and error files, '''log_.err''' and '''log_.out''', observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.) {{{ [tulaneID@cypress1 ~]$cat log_3300700.err slurmstepd: *** JOB 3300700 CANCELLED AT 2026-03-13T22:20:20 *** slurmstepd: *** JOB 3300700 CANCELLED AT 2026-03-13T22:28:00 *** }}} {{{ [tulaneID@cypress1 ~]$cat log_3300700.out Info[20260313-22:18:18]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=0 Info[20260313-22:18:18]: Settings: Info[20260313-22:18:18]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:18:18]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:18:18]: LAUNCH_MODE=direct Info[20260313-22:18:18]: SRUN_ARGS=-n 1 Info[20260313-22:18:18]: TIME_LIMIT=00:03:00 Info[20260313-22:18:18]: MARGIN_SEC=60 Info[20260313-22:18:18]: CKPT_PATH=state_iter_r.txt Info[20260313-22:18:18]: CHECKPOINT_EVERY=20 Info[20260313-22:18:18]: MAX_ITER=500 Info[20260313-22:18:18]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account= QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:18:18 EndTime=2026-03-13T22:21:18 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=0 (every 20, MAX_ITER=500) [periodic/iter] saved i=20 [periodic/iter] saved i=40 [periodic/iter] saved i=60 [periodic/iter] saved i=80 [periodic/iter] saved i=100 R: SIGTERM detected — saving i=117 and exiting 99 Info[20260313-22:20:19]: Program exit code (from timeout wrapper): 124 Info[20260313-22:20:19]: Timeout TERM observed; checkpoint advanced (0->117). Requeueing... Info[20260313-22:20:19]: Requeued via scontrol. Info[20260313-22:20:58]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=1 Info[20260313-22:20:58]: Settings: Info[20260313-22:20:58]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:20:58]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:20:58]: LAUNCH_MODE=direct Info[20260313-22:20:58]: SRUN_ARGS=-n 1 Info[20260313-22:20:58]: TIME_LIMIT=00:03:00 Info[20260313-22:20:58]: MARGIN_SEC=60 Info[20260313-22:20:58]: CKPT_PATH=state_iter_r.txt Info[20260313-22:20:58]: CHECKPOINT_EVERY=20 Info[20260313-22:20:58]: MAX_ITER=500 Info[20260313-22:20:58]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account= QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=117 (every 20, MAX_ITER=500) [periodic/iter] saved i=120 [periodic/iter] saved i=140 [periodic/iter] saved i=160 [periodic/iter] saved i=180 [periodic/iter] saved i=200 [periodic/iter] saved i=220 R: SIGTERM detected — saving i=236 and exiting 99 Info[20260313-22:22:59]: Program exit code (from timeout wrapper): 124 Info[20260313-22:22:59]: Timeout TERM observed; checkpoint advanced (117->236). Requeueing... Info[20260313-22:22:59]: Requeued via scontrol. Info[20260313-22:23:29]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=2 Info[20260313-22:23:29]: Settings: Info[20260313-22:23:29]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:23:29]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:23:29]: LAUNCH_MODE=direct Info[20260313-22:23:29]: SRUN_ARGS=-n 1 Info[20260313-22:23:29]: TIME_LIMIT=00:03:00 Info[20260313-22:23:29]: MARGIN_SEC=60 Info[20260313-22:23:29]: CKPT_PATH=state_iter_r.txt Info[20260313-22:23:29]: CHECKPOINT_EVERY=20 Info[20260313-22:23:29]: MAX_ITER=500 Info[20260313-22:23:29]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account= QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=236 (every 20, MAX_ITER=500) [periodic/iter] saved i=240 [periodic/iter] saved i=260 [periodic/iter] saved i=280 [periodic/iter] saved i=300 [periodic/iter] saved i=320 [periodic/iter] saved i=340 R: SIGTERM detected — saving i=355 and exiting 99 Info[20260313-22:25:30]: Program exit code (from timeout wrapper): 124 Info[20260313-22:25:30]: Timeout TERM observed; checkpoint advanced (236->355). Requeueing... Info[20260313-22:25:30]: Requeued via scontrol. Info[20260313-22:25:59]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=3 Info[20260313-22:25:59]: Settings: Info[20260313-22:25:59]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:25:59]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:25:59]: LAUNCH_MODE=direct Info[20260313-22:25:59]: SRUN_ARGS=-n 1 Info[20260313-22:25:59]: TIME_LIMIT=00:03:00 Info[20260313-22:25:59]: MARGIN_SEC=60 Info[20260313-22:25:59]: CKPT_PATH=state_iter_r.txt Info[20260313-22:25:59]: CHECKPOINT_EVERY=20 Info[20260313-22:25:59]: MAX_ITER=500 Info[20260313-22:25:59]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account= QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=355 (every 20, MAX_ITER=500) [periodic/iter] saved i=360 [periodic/iter] saved i=380 [periodic/iter] saved i=400 [periodic/iter] saved i=420 [periodic/iter] saved i=440 [periodic/iter] saved i=460 R: SIGTERM detected — saving i=474 and exiting 99 Info[20260313-22:28:00]: Program exit code (from timeout wrapper): 124 Info[20260313-22:28:00]: Timeout TERM observed; checkpoint advanced (355->474). Requeueing... Info[20260313-22:28:00]: Requeued via scontrol. Info[20260313-22:28:19]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=4 Info[20260313-22:28:19]: Settings: Info[20260313-22:28:19]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:28:19]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:28:19]: LAUNCH_MODE=direct Info[20260313-22:28:19]: SRUN_ARGS=-n 1 Info[20260313-22:28:19]: TIME_LIMIT=00:03:00 Info[20260313-22:28:19]: MARGIN_SEC=60 Info[20260313-22:28:19]: CKPT_PATH=state_iter_r.txt Info[20260313-22:28:19]: CHECKPOINT_EVERY=20 Info[20260313-22:28:19]: MAX_ITER=500 Info[20260313-22:28:19]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account= QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=474 (every 20, MAX_ITER=500) [periodic/iter] saved i=480 [periodic/iter] saved i=500 Reached i=501 > 500; exiting 0 Info[20260313-22:28:47]: Program exit code (from timeout wrapper): 0 Info[20260313-22:28:47]: Completed. }}}