wiki:Workshops/JobCheckpointing/Examples/R

R Checkpointing Example

Checkpoint Runner

See Checkpoint Runner for the contents of the job script file checkpoint_runner.sh.

R Checkpointing Application

checkpoint_signal_iter.R

#!/usr/bin/env Rscript
# checkpoint_signal_iter.R  (requires 'sigterm' package)
suppressWarnings(suppressMessages({
  if (!requireNamespace("sigterm", quietly = TRUE)) {
    message("ERROR: 'sigterm' package not installed; install via remotes/devtools.")
    quit(save="no", status=2L)
  }
}))

get_env <- function(k, d) { v <- Sys.getenv(k, unset=NA); if (is.na(v) || v=="") d else v }
CKPT <- get_env("CKPT_PATH", "state_iter.txt")
EVERY <- as.integer(get_env("CHECKPOINT_EVERY", "20"))
MAX_ITER <- as.integer(get_env("MAX_ITER", "500"))

atomic_save <- function(path, val) {
  dir <- dirname(normalizePath(path, mustWork = FALSE))
  if (!dir.exists(dir)) dir.create(dir, recursive = TRUE, showWarnings = FALSE)
  tmp <- tempfile(pattern = ".ckpt.", tmpdir = dir)
  con <- file(tmp, open="wt"); writeLines(as.character(val), con); flush(con); close(con)
  file.rename(tmp, path)
}
load_ckpt <- function(path) {
  if (!file.exists(path)) return(0L)
  txt <- tryCatch(readLines(path, warn = FALSE), error = function(e) "0")
  as.integer(gsub("[^0-9]", "", paste(txt, collapse = "")))
}

library(sigterm)  # installs a SIGTERM handler; poll has_sigterm_flag()

i <- load_ckpt(CKPT)
cat(sprintf("R: Resuming from i=%d (every %d, MAX_ITER=%d)\n", i, EVERY, MAX_ITER)); flush(stdout())

repeat {
  i <- i + 1L
  Sys.sleep(1)

  if (i %% EVERY == 0L) {
    atomic_save(CKPT, i)
    cat(sprintf("[periodic/iter] saved i=%d\n", i)); flush(stdout())
  }

  if (sigterm::has_sigterm_flag()) {
    cat(sprintf("R: SIGTERM detected — saving i=%d and exiting 99\n", i)); flush(stdout())
    atomic_save(CKPT, i)
    quit(save="no", status=99L)
  }

  if (i > MAX_ITER) {
    cat(sprintf("Reached i=%d > %d; exiting 0\n", i, MAX_ITER)); flush(stdout())
    atomic_save(CKPT, i)
    quit(save="no", status=0L)
  }
}

Running R checkpointing example on Cypress

To run the R checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.

  1. Edit the files checkpoint_runner.sh and checkpoint_signal_iter.R in your current directory. For file editing with nano, etc., see File Editing Example.
  1. Use the steps in the section Alternative for tidyverse for constructing a container image file tidyverse_latest.sif (or tidyverse_4.5.2.sif as in the following) for use with Singularity.
  1. Install the required R package sigterm via the following, substituting your own writable R library directory for <R_lib_path>. See Installing R Packages on Cypress for more information on using a writable R library directory.
[tulaneID@cypress1 ~]$idev --partition=centos7
[tulaneID@cypress01-XXX ~]$module load singularity/3.9.0
[tulaneID@cypress01-XXX ~]$singularity shell tidyverse_4.5.2.sif
Singularity> Rscript --version # confirm Rscript is available
Rscript (R) version 4.5.2 (2025-10-31)
Singularity>Rscript -e "devtools::install_github('atheriel/sigterm', lib='<R_lib_path>')"
Singularity>exit # exit the container
[tulaneID@cypress01-XXX ~]$exit # exit the interactive session
  1. Submit the job via the following command.
[tulaneID@cypress1 ~]$APP_CMD="singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R" MODULE_LIST="singularity/3.9.0" CKPT_PATH=state_iter_r.txt sbatch checkpoint_runner.sh
  1. Monitor the job's output via the following command, substituting the job ID for <jobID>.
[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
  1. Here are normal results for the output and error files, log_<jobID>.err and log_<jobID>.out, observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.)
[tulaneID@cypress1 ~]$cat log_3300700.err
slurmstepd: *** JOB 3300700 CANCELLED AT 2026-03-13T22:20:20 ***
slurmstepd: *** JOB 3300700 CANCELLED AT 2026-03-13T22:28:00 ***

[tulaneID@cypress1 ~]$cat log_3300700.out
Info[20260313-22:18:18]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=0
Info[20260313-22:18:18]: Settings:
Info[20260313-22:18:18]: MODULE_LIST=singularity/3.9.0
Info[20260313-22:18:18]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R
Info[20260313-22:18:18]: LAUNCH_MODE=direct
Info[20260313-22:18:18]: SRUN_ARGS=-n 1
Info[20260313-22:18:18]: TIME_LIMIT=00:03:00
Info[20260313-22:18:18]: MARGIN_SEC=60
Info[20260313-22:18:18]: CKPT_PATH=state_iter_r.txt
Info[20260313-22:18:18]: CHECKPOINT_EVERY=20
Info[20260313-22:18:18]: MAX_ITER=500
Info[20260313-22:18:18]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300700 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:18:18 EndTime=2026-03-13T22:21:18
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
R: Resuming from i=0 (every 20, MAX_ITER=500)
[periodic/iter] saved i=20
[periodic/iter] saved i=40
[periodic/iter] saved i=60
[periodic/iter] saved i=80
[periodic/iter] saved i=100
R: SIGTERM detected — saving i=117 and exiting 99
Info[20260313-22:20:19]: Program exit code (from timeout wrapper): 124
Info[20260313-22:20:19]: Timeout TERM observed; checkpoint advanced (0->117). Requeueing...
Info[20260313-22:20:19]: Requeued via scontrol.
Info[20260313-22:20:58]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=1
Info[20260313-22:20:58]: Settings:
Info[20260313-22:20:58]: MODULE_LIST=singularity/3.9.0
Info[20260313-22:20:58]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R
Info[20260313-22:20:58]: LAUNCH_MODE=direct
Info[20260313-22:20:58]: SRUN_ARGS=-n 1
Info[20260313-22:20:58]: TIME_LIMIT=00:03:00
Info[20260313-22:20:58]: MARGIN_SEC=60
Info[20260313-22:20:58]: CKPT_PATH=state_iter_r.txt
Info[20260313-22:20:58]: CHECKPOINT_EVERY=20
Info[20260313-22:20:58]: MAX_ITER=500
Info[20260313-22:20:58]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300700 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
R: Resuming from i=117 (every 20, MAX_ITER=500)
[periodic/iter] saved i=120
[periodic/iter] saved i=140
[periodic/iter] saved i=160
[periodic/iter] saved i=180
[periodic/iter] saved i=200
[periodic/iter] saved i=220
R: SIGTERM detected — saving i=236 and exiting 99
Info[20260313-22:22:59]: Program exit code (from timeout wrapper): 124
Info[20260313-22:22:59]: Timeout TERM observed; checkpoint advanced (117->236). Requeueing...
Info[20260313-22:22:59]: Requeued via scontrol.
Info[20260313-22:23:29]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=2
Info[20260313-22:23:29]: Settings:
Info[20260313-22:23:29]: MODULE_LIST=singularity/3.9.0
Info[20260313-22:23:29]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R
Info[20260313-22:23:29]: LAUNCH_MODE=direct
Info[20260313-22:23:29]: SRUN_ARGS=-n 1
Info[20260313-22:23:29]: TIME_LIMIT=00:03:00
Info[20260313-22:23:29]: MARGIN_SEC=60
Info[20260313-22:23:29]: CKPT_PATH=state_iter_r.txt
Info[20260313-22:23:29]: CHECKPOINT_EVERY=20
Info[20260313-22:23:29]: MAX_ITER=500
Info[20260313-22:23:29]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300700 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
R: Resuming from i=236 (every 20, MAX_ITER=500)
[periodic/iter] saved i=240
[periodic/iter] saved i=260
[periodic/iter] saved i=280
[periodic/iter] saved i=300
[periodic/iter] saved i=320
[periodic/iter] saved i=340
R: SIGTERM detected — saving i=355 and exiting 99
Info[20260313-22:25:30]: Program exit code (from timeout wrapper): 124
Info[20260313-22:25:30]: Timeout TERM observed; checkpoint advanced (236->355). Requeueing...
Info[20260313-22:25:30]: Requeued via scontrol.
Info[20260313-22:25:59]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=3
Info[20260313-22:25:59]: Settings:
Info[20260313-22:25:59]: MODULE_LIST=singularity/3.9.0
Info[20260313-22:25:59]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R
Info[20260313-22:25:59]: LAUNCH_MODE=direct
Info[20260313-22:25:59]: SRUN_ARGS=-n 1
Info[20260313-22:25:59]: TIME_LIMIT=00:03:00
Info[20260313-22:25:59]: MARGIN_SEC=60
Info[20260313-22:25:59]: CKPT_PATH=state_iter_r.txt
Info[20260313-22:25:59]: CHECKPOINT_EVERY=20
Info[20260313-22:25:59]: MAX_ITER=500
Info[20260313-22:25:59]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300700 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
R: Resuming from i=355 (every 20, MAX_ITER=500)
[periodic/iter] saved i=360
[periodic/iter] saved i=380
[periodic/iter] saved i=400
[periodic/iter] saved i=420
[periodic/iter] saved i=440
[periodic/iter] saved i=460
R: SIGTERM detected — saving i=474 and exiting 99
Info[20260313-22:28:00]: Program exit code (from timeout wrapper): 124
Info[20260313-22:28:00]: Timeout TERM observed; checkpoint advanced (355->474). Requeueing...
Info[20260313-22:28:00]: Requeued via scontrol.
Info[20260313-22:28:19]: Start on cypress01-066; JOB_ID=3300700; RESTARTS=4
Info[20260313-22:28:19]: Settings:
Info[20260313-22:28:19]: MODULE_LIST=singularity/3.9.0
Info[20260313-22:28:19]: APP_CMD=singularity exec tidyverse_4.5.2.sif Rscript checkpoint_signal_iter.R
Info[20260313-22:28:19]: LAUNCH_MODE=direct
Info[20260313-22:28:19]: SRUN_ARGS=-n 1
Info[20260313-22:28:19]: TIME_LIMIT=00:03:00
Info[20260313-22:28:19]: MARGIN_SEC=60
Info[20260313-22:28:19]: CKPT_PATH=state_iter_r.txt
Info[20260313-22:28:19]: CHECKPOINT_EVERY=20
Info[20260313-22:28:19]: MAX_ITER=500
Info[20260313-22:28:19]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300700 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
R: Resuming from i=474 (every 20, MAX_ITER=500)
[periodic/iter] saved i=480
[periodic/iter] saved i=500
Reached i=501 > 500; exiting 0
Info[20260313-22:28:47]: Program exit code (from timeout wrapper): 0
Info[20260313-22:28:47]: Completed.
Last modified 2 days ago Last modified on 03/14/2026 01:19:36 AM
Note: See TracWiki for help on using the wiki.