R Checkpointing Example
Checkpoint Runner
See Checkpoint Runner for the contents of the job script file checkpoint_runner.sh.
R Checkpointing Application
checkpoint_signal_iter.R
#!/usr/bin/env Rscript
# checkpoint_signal_iter.R (requires 'sigterm' package)
suppressWarnings(suppressMessages({
if (!requireNamespace("sigterm", quietly = TRUE)) {
message("ERROR: 'sigterm' package not installed; install via remotes/devtools.")
quit(save="no", status=2L)
}
}))
get_env <- function(k, d) { v <- Sys.getenv(k, unset=NA); if (is.na(v) || v=="") d else v }
CKPT <- get_env("CKPT_PATH", "state_iter.txt")
EVERY <- as.integer(get_env("CHECKPOINT_EVERY", "20"))
MAX_ITER <- as.integer(get_env("MAX_ITER", "500"))
atomic_save <- function(path, val) {
dir <- dirname(normalizePath(path, mustWork = FALSE))
if (!dir.exists(dir)) dir.create(dir, recursive = TRUE, showWarnings = FALSE)
tmp <- tempfile(pattern = ".ckpt.", tmpdir = dir)
con <- file(tmp, open="wt"); writeLines(as.character(val), con); flush(con); close(con)
file.rename(tmp, path)
}
load_ckpt <- function(path) {
if (!file.exists(path)) return(0L)
txt <- tryCatch(readLines(path, warn = FALSE), error = function(e) "0")
as.integer(gsub("[^0-9]", "", paste(txt, collapse = "")))
}
library(sigterm) # installs a SIGTERM handler; poll has_sigterm_flag()
i <- load_ckpt(CKPT)
cat(sprintf("R: Resuming from i=%d (every %d, MAX_ITER=%d)\n", i, EVERY, MAX_ITER)); flush(stdout())
repeat {
i <- i + 1L
Sys.sleep(1)
if (i %% EVERY == 0L) {
atomic_save(CKPT, i)
cat(sprintf("[periodic/iter] saved i=%d\n", i)); flush(stdout())
}
if (sigterm::has_sigterm_flag()) {
cat(sprintf("R: SIGTERM detected — saving i=%d and exiting 99\n", i)); flush(stdout())
atomic_save(CKPT, i)
quit(save="no", status=99L)
}
if (i > MAX_ITER) {
cat(sprintf("Reached i=%d > %d; exiting 0\n", i, MAX_ITER)); flush(stdout())
atomic_save(CKPT, i)
quit(save="no", status=0L)
}
}
Running R checkpointing example on Cypress
To run the R checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.
- Edit the files checkpoint_runner.sh and checkpoint_signal_iter.R in your current directory. For file editing with nano, etc., see File Editing Example.
- Use the steps in the section Alternative for tidyverse for constructing a container image file tidyverse_latest.sif (or tidyverse_4.5.2.sif as in the following) for use with Singularity.
- Install the required R package sigterm via the following, substituting your own writable R library directory for <R_lib_path>. See Installing R Packages on Cypress for more information on using a writable R library directory.
[tulaneID@cypress1 ~]$idev --partition=centos7
[tulaneID@cypress01-XXX ~]$module load singularity/3.9.0
[tulaneID@cypress01-XXX ~]$singularity shell tidyverse_4.5.2.sif
Singularity> Rscript --version # confirm Rscript is available
Rscript (R) version 4.5.2 (2025-10-31)
Singularity>Rscript -e "devtools::install_github('atheriel/sigterm', lib='<R_lib_path>')"
Singularity>exit # exit the container
[tulaneID@cypress01-XXX ~]$exit # exit the interactive session
- Submit the job via the following command.
[tulaneID@cypress1 ~]$APP_CMD="singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R" MODULE_LIST="singularity/3.9.0" CKPT_PATH=state_iter_r.txt sbatch checkpoint_runner.sh
- Monitor the job's output via the following command, substituting the job ID for <jobID>.
[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
- Here are normal results for the output and error files, log_<jobID>.err and log_<jobID>.out, observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.)
[tulaneID@cypress1 ~]$cat log_3300700.err slurmstepd: *** JOB 3300700 CANCELLED AT 2026-03-13T22:20:20 *** slurmstepd: *** JOB 3300700 CANCELLED AT 2026-03-13T22:28:00 ***
[tulaneID@cypress1 ~]$cat log_3300700.out Info[20260313-22:18:18]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=0 Info[20260313-22:18:18]: Settings: Info[20260313-22:18:18]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:18:18]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:18:18]: LAUNCH_MODE=direct Info[20260313-22:18:18]: SRUN_ARGS=-n 1 Info[20260313-22:18:18]: TIME_LIMIT=00:03:00 Info[20260313-22:18:18]: MARGIN_SEC=60 Info[20260313-22:18:18]: CKPT_PATH=state_iter_r.txt Info[20260313-22:18:18]: CHECKPOINT_EVERY=20 Info[20260313-22:18:18]: MAX_ITER=500 Info[20260313-22:18:18]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:18:18 EndTime=2026-03-13T22:21:18 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=0 (every 20, MAX_ITER=500) [periodic/iter] saved i=20 [periodic/iter] saved i=40 [periodic/iter] saved i=60 [periodic/iter] saved i=80 [periodic/iter] saved i=100 R: SIGTERM detected — saving i=117 and exiting 99 Info[20260313-22:20:19]: Program exit code (from timeout wrapper): 124 Info[20260313-22:20:19]: Timeout TERM observed; checkpoint advanced (0->117). Requeueing... Info[20260313-22:20:19]: Requeued via scontrol. Info[20260313-22:20:58]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=1 Info[20260313-22:20:58]: Settings: Info[20260313-22:20:58]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:20:58]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:20:58]: LAUNCH_MODE=direct Info[20260313-22:20:58]: SRUN_ARGS=-n 1 Info[20260313-22:20:58]: TIME_LIMIT=00:03:00 Info[20260313-22:20:58]: MARGIN_SEC=60 Info[20260313-22:20:58]: CKPT_PATH=state_iter_r.txt Info[20260313-22:20:58]: CHECKPOINT_EVERY=20 Info[20260313-22:20:58]: MAX_ITER=500 Info[20260313-22:20:58]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=117 (every 20, MAX_ITER=500) [periodic/iter] saved i=120 [periodic/iter] saved i=140 [periodic/iter] saved i=160 [periodic/iter] saved i=180 [periodic/iter] saved i=200 [periodic/iter] saved i=220 R: SIGTERM detected — saving i=236 and exiting 99 Info[20260313-22:22:59]: Program exit code (from timeout wrapper): 124 Info[20260313-22:22:59]: Timeout TERM observed; checkpoint advanced (117->236). Requeueing... Info[20260313-22:22:59]: Requeued via scontrol. Info[20260313-22:23:29]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=2 Info[20260313-22:23:29]: Settings: Info[20260313-22:23:29]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:23:29]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:23:29]: LAUNCH_MODE=direct Info[20260313-22:23:29]: SRUN_ARGS=-n 1 Info[20260313-22:23:29]: TIME_LIMIT=00:03:00 Info[20260313-22:23:29]: MARGIN_SEC=60 Info[20260313-22:23:29]: CKPT_PATH=state_iter_r.txt Info[20260313-22:23:29]: CHECKPOINT_EVERY=20 Info[20260313-22:23:29]: MAX_ITER=500 Info[20260313-22:23:29]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=236 (every 20, MAX_ITER=500) [periodic/iter] saved i=240 [periodic/iter] saved i=260 [periodic/iter] saved i=280 [periodic/iter] saved i=300 [periodic/iter] saved i=320 [periodic/iter] saved i=340 R: SIGTERM detected — saving i=355 and exiting 99 Info[20260313-22:25:30]: Program exit code (from timeout wrapper): 124 Info[20260313-22:25:30]: Timeout TERM observed; checkpoint advanced (236->355). Requeueing... Info[20260313-22:25:30]: Requeued via scontrol. Info[20260313-22:25:59]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=3 Info[20260313-22:25:59]: Settings: Info[20260313-22:25:59]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:25:59]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:25:59]: LAUNCH_MODE=direct Info[20260313-22:25:59]: SRUN_ARGS=-n 1 Info[20260313-22:25:59]: TIME_LIMIT=00:03:00 Info[20260313-22:25:59]: MARGIN_SEC=60 Info[20260313-22:25:59]: CKPT_PATH=state_iter_r.txt Info[20260313-22:25:59]: CHECKPOINT_EVERY=20 Info[20260313-22:25:59]: MAX_ITER=500 Info[20260313-22:25:59]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=355 (every 20, MAX_ITER=500) [periodic/iter] saved i=360 [periodic/iter] saved i=380 [periodic/iter] saved i=400 [periodic/iter] saved i=420 [periodic/iter] saved i=440 [periodic/iter] saved i=460 R: SIGTERM detected — saving i=474 and exiting 99 Info[20260313-22:28:00]: Program exit code (from timeout wrapper): 124 Info[20260313-22:28:00]: Timeout TERM observed; checkpoint advanced (355->474). Requeueing... Info[20260313-22:28:00]: Requeued via scontrol. Info[20260313-22:28:19]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=4 Info[20260313-22:28:19]: Settings: Info[20260313-22:28:19]: MODULE_LIST=singularity/3.9.0 Info[20260313-22:28:19]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R Info[20260313-22:28:19]: LAUNCH_MODE=direct Info[20260313-22:28:19]: SRUN_ARGS=-n 1 Info[20260313-22:28:19]: TIME_LIMIT=00:03:00 Info[20260313-22:28:19]: MARGIN_SEC=60 Info[20260313-22:28:19]: CKPT_PATH=state_iter_r.txt Info[20260313-22:28:19]: CHECKPOINT_EVERY=20 Info[20260313-22:28:19]: MAX_ITER=500 Info[20260313-22:28:19]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300700 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === R: Resuming from i=474 (every 20, MAX_ITER=500) [periodic/iter] saved i=480 [periodic/iter] saved i=500 Reached i=501 > 500; exiting 0 Info[20260313-22:28:47]: Program exit code (from timeout wrapper): 0 Info[20260313-22:28:47]: Completed.
Last modified
2 days ago
Last modified on 03/14/2026 01:19:36 AM
Note:
See TracWiki
for help on using the wiki.
