| Version 1 (modified by , 2 days ago) ( diff ) |
|---|
Python Checkpointing Example
Checkpoint Runner
See Checkpoint Runner for the contents of the job script file checkpoint_runner.sh.
Python Checkpointing Application
checkpoint_signal_iter.py
#!/usr/bin/env python3
import os, time, sys, signal, tempfile
CKPT = os.environ.get("CKPT_PATH", "state_iter.json")
EVERY = int(os.environ.get("CHECKPOINT_EVERY", "20"))
MAX_ITER = int(os.environ.get("MAX_ITER", "500"))
_current = None
def atomic_save(path: str, data: bytes):
# atomic write (tmp + fsync + replace)
dirpath = os.path.dirname(os.path.abspath(path)) or "."
fd, tmppath = tempfile.mkstemp(prefix=".ckpt.", dir=dirpath)
try:
with os.fdopen(fd, "wb") as f:
f.write(data); f.flush(); os.fsync(f.fileno())
os.replace(tmppath, path)
finally:
try:
if os.path.exists(tmppath):
os.remove(tmppath)
except Exception:
pass
def save(i: int):
atomic_save(CKPT, f"{i}".encode("utf-8"))
def load() -> int:
try:
with open(CKPT, "rb") as f:
return int(f.read().decode("utf-8").strip())
except Exception:
return 0
def on_sigterm(signum, frame):
i = _current if _current is not None else load()
print(f"SIGTERM: saving i={i} and exiting 99", flush=True)
save(i)
sys.exit(99)
signal.signal(signal.SIGTERM, on_sigterm)
def main():
global _current
i = load()
print(f"Resuming from i={i} (iteration-based every {EVERY})", flush=True)
while True:
i += 1
_current = i
time.sleep(1)
if i % EVERY == 0:
save(i)
print(f"[periodic/iter] saved i={i}", flush=True)
if i > MAX_ITER:
print(f"Reached i={i} > {MAX_ITER}; exiting 0", flush=True)
save(i)
sys.exit(0)
if __name__ == "__main__":
main()
Running Python checkpointing example on Cypress
To run the Python checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.
- Edit the files checkpoint_runner.sh and checkpoint_signal_iter.py in your current directory. For file editing with nano, etc., see File Editing Example.
- Submit the job via the following command.
[tulaneID@cypress1 ~]$ CKPT_PATH=state_iter_py.txt sbatch checkpoint_runner.sh
- Monitor the job's output via the following command, substituting the job ID for <jobID>.
[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
- Here are normal results for the output and error files, log_<jobID>.err and log_<jobID>.out, observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.)
[tulaneID@cypress1 ~]$cat log_3300699.err slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:22:57 *** slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:25:27 *** slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:27:57 ***
[tulaneID@cypress1 ~]$cat log_3300699.out Info[20260313-22:18:16]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=0 Info[20260313-22:18:16]: Settings: Info[20260313-22:18:16]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:18:16]: APP_CMD=python3 checkpoint_signal_iter.py Info[20260313-22:18:16]: LAUNCH_MODE=direct Info[20260313-22:18:16]: SRUN_ARGS=-n 1 Info[20260313-22:18:16]: TIME_LIMIT=00:03:00 Info[20260313-22:18:16]: MARGIN_SEC=60 Info[20260313-22:18:16]: CKPT_PATH=state_iter_py.txt Info[20260313-22:18:16]: CHECKPOINT_EVERY=20 Info[20260313-22:18:16]: MAX_ITER=500 Info[20260313-22:18:16]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300699 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:18:16 EndTime=2026-03-13T22:21:16 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=0 (iteration-based every 20) [periodic/iter] saved i=20 [periodic/iter] saved i=40 [periodic/iter] saved i=60 [periodic/iter] saved i=80 [periodic/iter] saved i=100 SIGTERM: saving i=120 and exiting 99 Info[20260313-22:20:17]: Program exit code (from timeout wrapper): 124 Info[20260313-22:20:17]: Timeout TERM observed; checkpoint advanced (0->120). Requeueing... Info[20260313-22:20:17]: Requeued via scontrol. Info[20260313-22:20:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=1 Info[20260313-22:20:57]: Settings: Info[20260313-22:20:57]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:20:57]: APP_CMD=python3 checkpoint_signal_iter.py Info[20260313-22:20:57]: LAUNCH_MODE=direct Info[20260313-22:20:57]: SRUN_ARGS=-n 1 Info[20260313-22:20:57]: TIME_LIMIT=00:03:00 Info[20260313-22:20:57]: MARGIN_SEC=60 Info[20260313-22:20:57]: CKPT_PATH=state_iter_py.txt Info[20260313-22:20:57]: CHECKPOINT_EVERY=20 Info[20260313-22:20:57]: MAX_ITER=500 Info[20260313-22:20:57]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300699 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=120 (iteration-based every 20) [periodic/iter] saved i=140 [periodic/iter] saved i=160 [periodic/iter] saved i=180 [periodic/iter] saved i=200 [periodic/iter] saved i=220 SIGTERM: saving i=240 and exiting 99 Info[20260313-22:22:57]: Program exit code (from timeout wrapper): 124 Info[20260313-22:22:57]: Timeout TERM observed; checkpoint advanced (120->240). Requeueing... Info[20260313-22:22:57]: Requeued via scontrol. Info[20260313-22:23:27]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=2 Info[20260313-22:23:27]: Settings: Info[20260313-22:23:27]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:23:27]: APP_CMD=python3 checkpoint_signal_iter.py Info[20260313-22:23:27]: LAUNCH_MODE=direct Info[20260313-22:23:27]: SRUN_ARGS=-n 1 Info[20260313-22:23:27]: TIME_LIMIT=00:03:00 Info[20260313-22:23:27]: MARGIN_SEC=60 Info[20260313-22:23:27]: CKPT_PATH=state_iter_py.txt Info[20260313-22:23:27]: CHECKPOINT_EVERY=20 Info[20260313-22:23:27]: MAX_ITER=500 Info[20260313-22:23:27]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300699 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=240 (iteration-based every 20) [periodic/iter] saved i=260 [periodic/iter] saved i=280 [periodic/iter] saved i=300 [periodic/iter] saved i=320 [periodic/iter] saved i=340 SIGTERM: saving i=360 and exiting 99 Info[20260313-22:25:27]: Program exit code (from timeout wrapper): 124 Info[20260313-22:25:27]: Timeout TERM observed; checkpoint advanced (240->360). Requeueing... Info[20260313-22:25:27]: Requeued via scontrol. Info[20260313-22:25:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=3 Info[20260313-22:25:57]: Settings: Info[20260313-22:25:57]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:25:57]: APP_CMD=python3 checkpoint_signal_iter.py Info[20260313-22:25:57]: LAUNCH_MODE=direct Info[20260313-22:25:57]: SRUN_ARGS=-n 1 Info[20260313-22:25:57]: TIME_LIMIT=00:03:00 Info[20260313-22:25:57]: MARGIN_SEC=60 Info[20260313-22:25:57]: CKPT_PATH=state_iter_py.txt Info[20260313-22:25:57]: CHECKPOINT_EVERY=20 Info[20260313-22:25:57]: MAX_ITER=500 Info[20260313-22:25:57]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300699 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=360 (iteration-based every 20) [periodic/iter] saved i=380 [periodic/iter] saved i=400 [periodic/iter] saved i=420 [periodic/iter] saved i=440 [periodic/iter] saved i=460 SIGTERM: saving i=480 and exiting 99 Info[20260313-22:27:57]: Program exit code (from timeout wrapper): 124 Info[20260313-22:27:57]: Timeout TERM observed; checkpoint advanced (360->480). Requeueing... Info[20260313-22:27:57]: Requeued via scontrol. Info[20260313-22:28:20]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=4 Info[20260313-22:28:20]: Settings: Info[20260313-22:28:20]: MODULE_LIST=anaconda3/2023.07 Info[20260313-22:28:20]: APP_CMD=python3 checkpoint_signal_iter.py Info[20260313-22:28:20]: LAUNCH_MODE=direct Info[20260313-22:28:20]: SRUN_ARGS=-n 1 Info[20260313-22:28:20]: TIME_LIMIT=00:03:00 Info[20260313-22:28:20]: MARGIN_SEC=60 Info[20260313-22:28:20]: CKPT_PATH=state_iter_py.txt Info[20260313-22:28:20]: CHECKPOINT_EVERY=20 Info[20260313-22:28:20]: MAX_ITER=500 Info[20260313-22:28:20]: MAX_RESTARTS=10 === BEGIN JOB SNAPSHOT (scontrol) === JobId=3300699 Name=ckpt_requeue_demo Priority=80808 Nice=0 Account=<groupID> QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18 Partition=centos7 AllocNode:Sid=cypress2:33768 === END JOB SNAPSHOT (scontrol) === Resuming from i=480 (iteration-based every 20) [periodic/iter] saved i=500 Reached i=501 > 500; exiting 0 Info[20260313-22:28:41]: Program exit code (from timeout wrapper): 0 Info[20260313-22:28:41]: Completed.
Note:
See TracWiki
for help on using the wiki.
