wiki:Workshops/JobCheckpointing/Examples/Python

Version 1 (modified by Carl Baribault, 2 days ago) ( diff )

Initial page, reworked Python example

Python Checkpointing Example

Checkpoint Runner

See Checkpoint Runner for the contents of the job script file checkpoint_runner.sh.

Python Checkpointing Application

checkpoint_signal_iter.py

#!/usr/bin/env python3
import os, time, sys, signal, tempfile

CKPT = os.environ.get("CKPT_PATH", "state_iter.json")
EVERY = int(os.environ.get("CHECKPOINT_EVERY", "20"))
MAX_ITER = int(os.environ.get("MAX_ITER", "500"))
_current = None

def atomic_save(path: str, data: bytes):
    # atomic write (tmp + fsync + replace)
    dirpath = os.path.dirname(os.path.abspath(path)) or "."
    fd, tmppath = tempfile.mkstemp(prefix=".ckpt.", dir=dirpath)
    try:
        with os.fdopen(fd, "wb") as f:
            f.write(data); f.flush(); os.fsync(f.fileno())
        os.replace(tmppath, path)
    finally:
        try:
            if os.path.exists(tmppath):
                os.remove(tmppath)
        except Exception:
            pass

def save(i: int):
    atomic_save(CKPT, f"{i}".encode("utf-8"))

def load() -> int:
    try:
        with open(CKPT, "rb") as f:
            return int(f.read().decode("utf-8").strip())
    except Exception:
        return 0

def on_sigterm(signum, frame):
    i = _current if _current is not None else load()
    print(f"SIGTERM: saving i={i} and exiting 99", flush=True)
    save(i)
    sys.exit(99)

signal.signal(signal.SIGTERM, on_sigterm)

def main():
    global _current
    i = load()
    print(f"Resuming from i={i} (iteration-based every {EVERY})", flush=True)
    while True:
        i += 1
        _current = i
        time.sleep(1)

        if i % EVERY == 0:
            save(i)
            print(f"[periodic/iter] saved i={i}", flush=True)

        if i > MAX_ITER:
            print(f"Reached i={i} > {MAX_ITER}; exiting 0", flush=True)
            save(i)
            sys.exit(0)

if __name__ == "__main__":
    main()

Running Python checkpointing example on Cypress

To run the Python checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.

  1. Edit the files checkpoint_runner.sh and checkpoint_signal_iter.py in your current directory. For file editing with nano, etc., see File Editing Example.
  1. Submit the job via the following command.
[tulaneID@cypress1 ~]$ CKPT_PATH=state_iter_py.txt sbatch checkpoint_runner.sh
  1. Monitor the job's output via the following command, substituting the job ID for <jobID>.
[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
  1. Here are normal results for the output and error files, log_<jobID>.err and log_<jobID>.out, observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.)
[tulaneID@cypress1 ~]$cat log_3300699.err
slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:22:57 ***
slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:25:27 ***
slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:27:57 ***

[tulaneID@cypress1 ~]$cat log_3300699.out
Info[20260313-22:18:16]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=0
Info[20260313-22:18:16]: Settings:
Info[20260313-22:18:16]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:18:16]: APP_CMD=python3 checkpoint_signal_iter.py
Info[20260313-22:18:16]: LAUNCH_MODE=direct
Info[20260313-22:18:16]: SRUN_ARGS=-n 1
Info[20260313-22:18:16]: TIME_LIMIT=00:03:00
Info[20260313-22:18:16]: MARGIN_SEC=60
Info[20260313-22:18:16]: CKPT_PATH=state_iter_py.txt
Info[20260313-22:18:16]: CHECKPOINT_EVERY=20
Info[20260313-22:18:16]: MAX_ITER=500
Info[20260313-22:18:16]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300699 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:18:16 EndTime=2026-03-13T22:21:16
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=0 (iteration-based every 20)
[periodic/iter] saved i=20
[periodic/iter] saved i=40
[periodic/iter] saved i=60
[periodic/iter] saved i=80
[periodic/iter] saved i=100
SIGTERM: saving i=120 and exiting 99
Info[20260313-22:20:17]: Program exit code (from timeout wrapper): 124
Info[20260313-22:20:17]: Timeout TERM observed; checkpoint advanced (0->120). Requeueing...
Info[20260313-22:20:17]: Requeued via scontrol.
Info[20260313-22:20:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=1
Info[20260313-22:20:57]: Settings:
Info[20260313-22:20:57]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:20:57]: APP_CMD=python3 checkpoint_signal_iter.py
Info[20260313-22:20:57]: LAUNCH_MODE=direct
Info[20260313-22:20:57]: SRUN_ARGS=-n 1
Info[20260313-22:20:57]: TIME_LIMIT=00:03:00
Info[20260313-22:20:57]: MARGIN_SEC=60
Info[20260313-22:20:57]: CKPT_PATH=state_iter_py.txt
Info[20260313-22:20:57]: CHECKPOINT_EVERY=20
Info[20260313-22:20:57]: MAX_ITER=500
Info[20260313-22:20:57]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300699 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=120 (iteration-based every 20)
[periodic/iter] saved i=140
[periodic/iter] saved i=160
[periodic/iter] saved i=180
[periodic/iter] saved i=200
[periodic/iter] saved i=220
SIGTERM: saving i=240 and exiting 99
Info[20260313-22:22:57]: Program exit code (from timeout wrapper): 124
Info[20260313-22:22:57]: Timeout TERM observed; checkpoint advanced (120->240). Requeueing...
Info[20260313-22:22:57]: Requeued via scontrol.
Info[20260313-22:23:27]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=2
Info[20260313-22:23:27]: Settings:
Info[20260313-22:23:27]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:23:27]: APP_CMD=python3 checkpoint_signal_iter.py
Info[20260313-22:23:27]: LAUNCH_MODE=direct
Info[20260313-22:23:27]: SRUN_ARGS=-n 1
Info[20260313-22:23:27]: TIME_LIMIT=00:03:00
Info[20260313-22:23:27]: MARGIN_SEC=60
Info[20260313-22:23:27]: CKPT_PATH=state_iter_py.txt
Info[20260313-22:23:27]: CHECKPOINT_EVERY=20
Info[20260313-22:23:27]: MAX_ITER=500
Info[20260313-22:23:27]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300699 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=240 (iteration-based every 20)
[periodic/iter] saved i=260
[periodic/iter] saved i=280
[periodic/iter] saved i=300
[periodic/iter] saved i=320
[periodic/iter] saved i=340
SIGTERM: saving i=360 and exiting 99
Info[20260313-22:25:27]: Program exit code (from timeout wrapper): 124
Info[20260313-22:25:27]: Timeout TERM observed; checkpoint advanced (240->360). Requeueing...
Info[20260313-22:25:27]: Requeued via scontrol.
Info[20260313-22:25:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=3
Info[20260313-22:25:57]: Settings:
Info[20260313-22:25:57]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:25:57]: APP_CMD=python3 checkpoint_signal_iter.py
Info[20260313-22:25:57]: LAUNCH_MODE=direct
Info[20260313-22:25:57]: SRUN_ARGS=-n 1
Info[20260313-22:25:57]: TIME_LIMIT=00:03:00
Info[20260313-22:25:57]: MARGIN_SEC=60
Info[20260313-22:25:57]: CKPT_PATH=state_iter_py.txt
Info[20260313-22:25:57]: CHECKPOINT_EVERY=20
Info[20260313-22:25:57]: MAX_ITER=500
Info[20260313-22:25:57]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300699 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=360 (iteration-based every 20)
[periodic/iter] saved i=380
[periodic/iter] saved i=400
[periodic/iter] saved i=420
[periodic/iter] saved i=440
[periodic/iter] saved i=460
SIGTERM: saving i=480 and exiting 99
Info[20260313-22:27:57]: Program exit code (from timeout wrapper): 124
Info[20260313-22:27:57]: Timeout TERM observed; checkpoint advanced (360->480). Requeueing...
Info[20260313-22:27:57]: Requeued via scontrol.
Info[20260313-22:28:20]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=4
Info[20260313-22:28:20]: Settings:
Info[20260313-22:28:20]: MODULE_LIST=anaconda3/2023.07
Info[20260313-22:28:20]: APP_CMD=python3 checkpoint_signal_iter.py
Info[20260313-22:28:20]: LAUNCH_MODE=direct
Info[20260313-22:28:20]: SRUN_ARGS=-n 1
Info[20260313-22:28:20]: TIME_LIMIT=00:03:00
Info[20260313-22:28:20]: MARGIN_SEC=60
Info[20260313-22:28:20]: CKPT_PATH=state_iter_py.txt
Info[20260313-22:28:20]: CHECKPOINT_EVERY=20
Info[20260313-22:28:20]: MAX_ITER=500
Info[20260313-22:28:20]: MAX_RESTARTS=10
=== BEGIN JOB SNAPSHOT (scontrol) ===
JobId=3300699 Name=ckpt_requeue_demo
   Priority=80808 Nice=0 Account=<groupID> QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A
   StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18
   Partition=centos7 AllocNode:Sid=cypress2:33768
=== END JOB SNAPSHOT (scontrol) ===
Resuming from i=480 (iteration-based every 20)
[periodic/iter] saved i=500
Reached i=501 > 500; exiting 0
Info[20260313-22:28:41]: Program exit code (from timeout wrapper): 0
Info[20260313-22:28:41]: Completed.
Note: See TracWiki for help on using the wiki.