Changes between Initial Version and Version 1 of Workshops/JobCheckpointing/Examples/Python


Ignore:
Timestamp:
03/13/2026 11:56:07 PM (2 days ago)
Author:
Carl Baribault
Comment:

Initial page, reworked Python example

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobCheckpointing/Examples/Python

    v1 v1  
     1[[PageOutline]]
     2= Python Checkpointing Example =
     3
     4== Checkpoint Runner ==
     5
     6See [wiki:Workshops/JobCheckpointing/Examples#CheckpointRunner Checkpoint Runner] for the contents of the job script file '''checkpoint_runner.sh'''.
     7
     8== Python Checkpointing Application ==
     9
     10=== checkpoint_signal_iter.py ===
     11
     12{{{
     13#!/usr/bin/env python3
     14import os, time, sys, signal, tempfile
     15
     16CKPT = os.environ.get("CKPT_PATH", "state_iter.json")
     17EVERY = int(os.environ.get("CHECKPOINT_EVERY", "20"))
     18MAX_ITER = int(os.environ.get("MAX_ITER", "500"))
     19_current = None
     20
     21def atomic_save(path: str, data: bytes):
     22    # atomic write (tmp + fsync + replace)
     23    dirpath = os.path.dirname(os.path.abspath(path)) or "."
     24    fd, tmppath = tempfile.mkstemp(prefix=".ckpt.", dir=dirpath)
     25    try:
     26        with os.fdopen(fd, "wb") as f:
     27            f.write(data); f.flush(); os.fsync(f.fileno())
     28        os.replace(tmppath, path)
     29    finally:
     30        try:
     31            if os.path.exists(tmppath):
     32                os.remove(tmppath)
     33        except Exception:
     34            pass
     35
     36def save(i: int):
     37    atomic_save(CKPT, f"{i}".encode("utf-8"))
     38
     39def load() -> int:
     40    try:
     41        with open(CKPT, "rb") as f:
     42            return int(f.read().decode("utf-8").strip())
     43    except Exception:
     44        return 0
     45
     46def on_sigterm(signum, frame):
     47    i = _current if _current is not None else load()
     48    print(f"SIGTERM: saving i={i} and exiting 99", flush=True)
     49    save(i)
     50    sys.exit(99)
     51
     52signal.signal(signal.SIGTERM, on_sigterm)
     53
     54def main():
     55    global _current
     56    i = load()
     57    print(f"Resuming from i={i} (iteration-based every {EVERY})", flush=True)
     58    while True:
     59        i += 1
     60        _current = i
     61        time.sleep(1)
     62
     63        if i % EVERY == 0:
     64            save(i)
     65            print(f"[periodic/iter] saved i={i}", flush=True)
     66
     67        if i > MAX_ITER:
     68            print(f"Reached i={i} > {MAX_ITER}; exiting 0", flush=True)
     69            save(i)
     70            sys.exit(0)
     71
     72if __name__ == "__main__":
     73    main()
     74}}}
     75
     76== Running Python checkpointing example on Cypress ==
     77
     78To run the Python checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following.
     79
     801. Edit the files '''checkpoint_runner.sh''' and '''checkpoint_signal_iter.py''' in your current directory.
     81 For file editing with nano, etc., see [[https://wiki.hpc.tulane.edu/trac/wiki/cypress/FileEditingSoftware/Example|File Editing Example]].
     82
     832. Submit the job via the following command.
     84
     85{{{
     86[tulaneID@cypress1 ~]$ CKPT_PATH=state_iter_py.txt sbatch checkpoint_runner.sh
     87}}}
     88
     892. Monitor the job's output via the following command, substituting the job ID for <jobID>.
     90
     91{{{
     92[tulaneID@cypress1 ~]$ tail -f log_<jobID>.*
     93}}}
     94
     953. Here are normal results for the output and error files, '''log_<jobID>.err''' and  '''log_<jobID>.out''', observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.)
     96
     97{{{
     98[tulaneID@cypress1 ~]$cat log_3300699.err
     99slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:22:57 ***
     100slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:25:27 ***
     101slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:27:57 ***
     102}}}
     103 
     104{{{
     105[tulaneID@cypress1 ~]$cat log_3300699.out
     106Info[20260313-22:18:16]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=0
     107Info[20260313-22:18:16]: Settings:
     108Info[20260313-22:18:16]: MODULE_LIST=anaconda3/2023.07
     109Info[20260313-22:18:16]: APP_CMD=python3 checkpoint_signal_iter.py
     110Info[20260313-22:18:16]: LAUNCH_MODE=direct
     111Info[20260313-22:18:16]: SRUN_ARGS=-n 1
     112Info[20260313-22:18:16]: TIME_LIMIT=00:03:00
     113Info[20260313-22:18:16]: MARGIN_SEC=60
     114Info[20260313-22:18:16]: CKPT_PATH=state_iter_py.txt
     115Info[20260313-22:18:16]: CHECKPOINT_EVERY=20
     116Info[20260313-22:18:16]: MAX_ITER=500
     117Info[20260313-22:18:16]: MAX_RESTARTS=10
     118=== BEGIN JOB SNAPSHOT (scontrol) ===
     119JobId=3300699 Name=ckpt_requeue_demo
     120   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     121   JobState=RUNNING Reason=None Dependency=(null)
     122   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
     123   RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A
     124   StartTime=2026-03-13T22:18:16 EndTime=2026-03-13T22:21:16
     125   Partition=centos7 AllocNode:Sid=cypress2:33768
     126=== END JOB SNAPSHOT (scontrol) ===
     127Resuming from i=0 (iteration-based every 20)
     128[periodic/iter] saved i=20
     129[periodic/iter] saved i=40
     130[periodic/iter] saved i=60
     131[periodic/iter] saved i=80
     132[periodic/iter] saved i=100
     133SIGTERM: saving i=120 and exiting 99
     134Info[20260313-22:20:17]: Program exit code (from timeout wrapper): 124
     135Info[20260313-22:20:17]: Timeout TERM observed; checkpoint advanced (0->120). Requeueing...
     136Info[20260313-22:20:17]: Requeued via scontrol.
     137Info[20260313-22:20:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=1
     138Info[20260313-22:20:57]: Settings:
     139Info[20260313-22:20:57]: MODULE_LIST=anaconda3/2023.07
     140Info[20260313-22:20:57]: APP_CMD=python3 checkpoint_signal_iter.py
     141Info[20260313-22:20:57]: LAUNCH_MODE=direct
     142Info[20260313-22:20:57]: SRUN_ARGS=-n 1
     143Info[20260313-22:20:57]: TIME_LIMIT=00:03:00
     144Info[20260313-22:20:57]: MARGIN_SEC=60
     145Info[20260313-22:20:57]: CKPT_PATH=state_iter_py.txt
     146Info[20260313-22:20:57]: CHECKPOINT_EVERY=20
     147Info[20260313-22:20:57]: MAX_ITER=500
     148Info[20260313-22:20:57]: MAX_RESTARTS=10
     149=== BEGIN JOB SNAPSHOT (scontrol) ===
     150JobId=3300699 Name=ckpt_requeue_demo
     151   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     152   JobState=RUNNING Reason=None Dependency=(null)
     153   Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0
     154   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     155   StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57
     156   Partition=centos7 AllocNode:Sid=cypress2:33768
     157=== END JOB SNAPSHOT (scontrol) ===
     158Resuming from i=120 (iteration-based every 20)
     159[periodic/iter] saved i=140
     160[periodic/iter] saved i=160
     161[periodic/iter] saved i=180
     162[periodic/iter] saved i=200
     163[periodic/iter] saved i=220
     164SIGTERM: saving i=240 and exiting 99
     165Info[20260313-22:22:57]: Program exit code (from timeout wrapper): 124
     166Info[20260313-22:22:57]: Timeout TERM observed; checkpoint advanced (120->240). Requeueing...
     167Info[20260313-22:22:57]: Requeued via scontrol.
     168Info[20260313-22:23:27]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=2
     169Info[20260313-22:23:27]: Settings:
     170Info[20260313-22:23:27]: MODULE_LIST=anaconda3/2023.07
     171Info[20260313-22:23:27]: APP_CMD=python3 checkpoint_signal_iter.py
     172Info[20260313-22:23:27]: LAUNCH_MODE=direct
     173Info[20260313-22:23:27]: SRUN_ARGS=-n 1
     174Info[20260313-22:23:27]: TIME_LIMIT=00:03:00
     175Info[20260313-22:23:27]: MARGIN_SEC=60
     176Info[20260313-22:23:27]: CKPT_PATH=state_iter_py.txt
     177Info[20260313-22:23:27]: CHECKPOINT_EVERY=20
     178Info[20260313-22:23:27]: MAX_ITER=500
     179Info[20260313-22:23:27]: MAX_RESTARTS=10
     180=== BEGIN JOB SNAPSHOT (scontrol) ===
     181JobId=3300699 Name=ckpt_requeue_demo
     182   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     183   JobState=RUNNING Reason=None Dependency=(null)
     184   Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0
     185   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     186   StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27
     187   Partition=centos7 AllocNode:Sid=cypress2:33768
     188=== END JOB SNAPSHOT (scontrol) ===
     189Resuming from i=240 (iteration-based every 20)
     190[periodic/iter] saved i=260
     191[periodic/iter] saved i=280
     192[periodic/iter] saved i=300
     193[periodic/iter] saved i=320
     194[periodic/iter] saved i=340
     195SIGTERM: saving i=360 and exiting 99
     196Info[20260313-22:25:27]: Program exit code (from timeout wrapper): 124
     197Info[20260313-22:25:27]: Timeout TERM observed; checkpoint advanced (240->360). Requeueing...
     198Info[20260313-22:25:27]: Requeued via scontrol.
     199Info[20260313-22:25:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=3
     200Info[20260313-22:25:57]: Settings:
     201Info[20260313-22:25:57]: MODULE_LIST=anaconda3/2023.07
     202Info[20260313-22:25:57]: APP_CMD=python3 checkpoint_signal_iter.py
     203Info[20260313-22:25:57]: LAUNCH_MODE=direct
     204Info[20260313-22:25:57]: SRUN_ARGS=-n 1
     205Info[20260313-22:25:57]: TIME_LIMIT=00:03:00
     206Info[20260313-22:25:57]: MARGIN_SEC=60
     207Info[20260313-22:25:57]: CKPT_PATH=state_iter_py.txt
     208Info[20260313-22:25:57]: CHECKPOINT_EVERY=20
     209Info[20260313-22:25:57]: MAX_ITER=500
     210Info[20260313-22:25:57]: MAX_RESTARTS=10
     211=== BEGIN JOB SNAPSHOT (scontrol) ===
     212JobId=3300699 Name=ckpt_requeue_demo
     213   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     214   JobState=RUNNING Reason=None Dependency=(null)
     215   Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0
     216   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
     217   StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57
     218   Partition=centos7 AllocNode:Sid=cypress2:33768
     219=== END JOB SNAPSHOT (scontrol) ===
     220Resuming from i=360 (iteration-based every 20)
     221[periodic/iter] saved i=380
     222[periodic/iter] saved i=400
     223[periodic/iter] saved i=420
     224[periodic/iter] saved i=440
     225[periodic/iter] saved i=460
     226SIGTERM: saving i=480 and exiting 99
     227Info[20260313-22:27:57]: Program exit code (from timeout wrapper): 124
     228Info[20260313-22:27:57]: Timeout TERM observed; checkpoint advanced (360->480). Requeueing...
     229Info[20260313-22:27:57]: Requeued via scontrol.
     230Info[20260313-22:28:20]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=4
     231Info[20260313-22:28:20]: Settings:
     232Info[20260313-22:28:20]: MODULE_LIST=anaconda3/2023.07
     233Info[20260313-22:28:20]: APP_CMD=python3 checkpoint_signal_iter.py
     234Info[20260313-22:28:20]: LAUNCH_MODE=direct
     235Info[20260313-22:28:20]: SRUN_ARGS=-n 1
     236Info[20260313-22:28:20]: TIME_LIMIT=00:03:00
     237Info[20260313-22:28:20]: MARGIN_SEC=60
     238Info[20260313-22:28:20]: CKPT_PATH=state_iter_py.txt
     239Info[20260313-22:28:20]: CHECKPOINT_EVERY=20
     240Info[20260313-22:28:20]: MAX_ITER=500
     241Info[20260313-22:28:20]: MAX_RESTARTS=10
     242=== BEGIN JOB SNAPSHOT (scontrol) ===
     243JobId=3300699 Name=ckpt_requeue_demo
     244   Priority=80808 Nice=0 Account=<groupID> QOS=normal
     245   JobState=RUNNING Reason=None Dependency=(null)
     246   Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0
     247   RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A
     248   StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18
     249   Partition=centos7 AllocNode:Sid=cypress2:33768
     250=== END JOB SNAPSHOT (scontrol) ===
     251Resuming from i=480 (iteration-based every 20)
     252[periodic/iter] saved i=500
     253Reached i=501 > 500; exiting 0
     254Info[20260313-22:28:41]: Program exit code (from timeout wrapper): 0
     255Info[20260313-22:28:41]: Completed.
     256}}}