| | 1 | [[PageOutline]] |
| | 2 | = Python Checkpointing Example = |
| | 3 | |
| | 4 | == Checkpoint Runner == |
| | 5 | |
| | 6 | See [wiki:Workshops/JobCheckpointing/Examples#CheckpointRunner Checkpoint Runner] for the contents of the job script file '''checkpoint_runner.sh'''. |
| | 7 | |
| | 8 | == Python Checkpointing Application == |
| | 9 | |
| | 10 | === checkpoint_signal_iter.py === |
| | 11 | |
| | 12 | {{{ |
| | 13 | #!/usr/bin/env python3 |
| | 14 | import os, time, sys, signal, tempfile |
| | 15 | |
| | 16 | CKPT = os.environ.get("CKPT_PATH", "state_iter.json") |
| | 17 | EVERY = int(os.environ.get("CHECKPOINT_EVERY", "20")) |
| | 18 | MAX_ITER = int(os.environ.get("MAX_ITER", "500")) |
| | 19 | _current = None |
| | 20 | |
| | 21 | def atomic_save(path: str, data: bytes): |
| | 22 | # atomic write (tmp + fsync + replace) |
| | 23 | dirpath = os.path.dirname(os.path.abspath(path)) or "." |
| | 24 | fd, tmppath = tempfile.mkstemp(prefix=".ckpt.", dir=dirpath) |
| | 25 | try: |
| | 26 | with os.fdopen(fd, "wb") as f: |
| | 27 | f.write(data); f.flush(); os.fsync(f.fileno()) |
| | 28 | os.replace(tmppath, path) |
| | 29 | finally: |
| | 30 | try: |
| | 31 | if os.path.exists(tmppath): |
| | 32 | os.remove(tmppath) |
| | 33 | except Exception: |
| | 34 | pass |
| | 35 | |
| | 36 | def save(i: int): |
| | 37 | atomic_save(CKPT, f"{i}".encode("utf-8")) |
| | 38 | |
| | 39 | def load() -> int: |
| | 40 | try: |
| | 41 | with open(CKPT, "rb") as f: |
| | 42 | return int(f.read().decode("utf-8").strip()) |
| | 43 | except Exception: |
| | 44 | return 0 |
| | 45 | |
| | 46 | def on_sigterm(signum, frame): |
| | 47 | i = _current if _current is not None else load() |
| | 48 | print(f"SIGTERM: saving i={i} and exiting 99", flush=True) |
| | 49 | save(i) |
| | 50 | sys.exit(99) |
| | 51 | |
| | 52 | signal.signal(signal.SIGTERM, on_sigterm) |
| | 53 | |
| | 54 | def main(): |
| | 55 | global _current |
| | 56 | i = load() |
| | 57 | print(f"Resuming from i={i} (iteration-based every {EVERY})", flush=True) |
| | 58 | while True: |
| | 59 | i += 1 |
| | 60 | _current = i |
| | 61 | time.sleep(1) |
| | 62 | |
| | 63 | if i % EVERY == 0: |
| | 64 | save(i) |
| | 65 | print(f"[periodic/iter] saved i={i}", flush=True) |
| | 66 | |
| | 67 | if i > MAX_ITER: |
| | 68 | print(f"Reached i={i} > {MAX_ITER}; exiting 0", flush=True) |
| | 69 | save(i) |
| | 70 | sys.exit(0) |
| | 71 | |
| | 72 | if __name__ == "__main__": |
| | 73 | main() |
| | 74 | }}} |
| | 75 | |
| | 76 | == Running Python checkpointing example on Cypress == |
| | 77 | |
| | 78 | To run the Python checkpointing job example, defaulting to checkpointing every 20 application iterations and a total of 500 iterations, perform the following. |
| | 79 | |
| | 80 | 1. Edit the files '''checkpoint_runner.sh''' and '''checkpoint_signal_iter.py''' in your current directory. |
| | 81 | For file editing with nano, etc., see [[https://wiki.hpc.tulane.edu/trac/wiki/cypress/FileEditingSoftware/Example|File Editing Example]]. |
| | 82 | |
| | 83 | 2. Submit the job via the following command. |
| | 84 | |
| | 85 | {{{ |
| | 86 | [tulaneID@cypress1 ~]$ CKPT_PATH=state_iter_py.txt sbatch checkpoint_runner.sh |
| | 87 | }}} |
| | 88 | |
| | 89 | 2. Monitor the job's output via the following command, substituting the job ID for <jobID>. |
| | 90 | |
| | 91 | {{{ |
| | 92 | [tulaneID@cypress1 ~]$ tail -f log_<jobID>.* |
| | 93 | }}} |
| | 94 | |
| | 95 | 3. Here are normal results for the output and error files, '''log_<jobID>.err''' and '''log_<jobID>.out''', observing that the job cancelled and requeued itself many times. (Not all cancellations were captured in the error file.) |
| | 96 | |
| | 97 | {{{ |
| | 98 | [tulaneID@cypress1 ~]$cat log_3300699.err |
| | 99 | slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:22:57 *** |
| | 100 | slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:25:27 *** |
| | 101 | slurmstepd: *** JOB 3300699 CANCELLED AT 2026-03-13T22:27:57 *** |
| | 102 | }}} |
| | 103 | |
| | 104 | {{{ |
| | 105 | [tulaneID@cypress1 ~]$cat log_3300699.out |
| | 106 | Info[20260313-22:18:16]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=0 |
| | 107 | Info[20260313-22:18:16]: Settings: |
| | 108 | Info[20260313-22:18:16]: MODULE_LIST=anaconda3/2023.07 |
| | 109 | Info[20260313-22:18:16]: APP_CMD=python3 checkpoint_signal_iter.py |
| | 110 | Info[20260313-22:18:16]: LAUNCH_MODE=direct |
| | 111 | Info[20260313-22:18:16]: SRUN_ARGS=-n 1 |
| | 112 | Info[20260313-22:18:16]: TIME_LIMIT=00:03:00 |
| | 113 | Info[20260313-22:18:16]: MARGIN_SEC=60 |
| | 114 | Info[20260313-22:18:16]: CKPT_PATH=state_iter_py.txt |
| | 115 | Info[20260313-22:18:16]: CHECKPOINT_EVERY=20 |
| | 116 | Info[20260313-22:18:16]: MAX_ITER=500 |
| | 117 | Info[20260313-22:18:16]: MAX_RESTARTS=10 |
| | 118 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 119 | JobId=3300699 Name=ckpt_requeue_demo |
| | 120 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 121 | JobState=RUNNING Reason=None Dependency=(null) |
| | 122 | Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 |
| | 123 | RunTime=00:00:01 TimeLimit=00:03:00 TimeMin=N/A |
| | 124 | StartTime=2026-03-13T22:18:16 EndTime=2026-03-13T22:21:16 |
| | 125 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 126 | === END JOB SNAPSHOT (scontrol) === |
| | 127 | Resuming from i=0 (iteration-based every 20) |
| | 128 | [periodic/iter] saved i=20 |
| | 129 | [periodic/iter] saved i=40 |
| | 130 | [periodic/iter] saved i=60 |
| | 131 | [periodic/iter] saved i=80 |
| | 132 | [periodic/iter] saved i=100 |
| | 133 | SIGTERM: saving i=120 and exiting 99 |
| | 134 | Info[20260313-22:20:17]: Program exit code (from timeout wrapper): 124 |
| | 135 | Info[20260313-22:20:17]: Timeout TERM observed; checkpoint advanced (0->120). Requeueing... |
| | 136 | Info[20260313-22:20:17]: Requeued via scontrol. |
| | 137 | Info[20260313-22:20:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=1 |
| | 138 | Info[20260313-22:20:57]: Settings: |
| | 139 | Info[20260313-22:20:57]: MODULE_LIST=anaconda3/2023.07 |
| | 140 | Info[20260313-22:20:57]: APP_CMD=python3 checkpoint_signal_iter.py |
| | 141 | Info[20260313-22:20:57]: LAUNCH_MODE=direct |
| | 142 | Info[20260313-22:20:57]: SRUN_ARGS=-n 1 |
| | 143 | Info[20260313-22:20:57]: TIME_LIMIT=00:03:00 |
| | 144 | Info[20260313-22:20:57]: MARGIN_SEC=60 |
| | 145 | Info[20260313-22:20:57]: CKPT_PATH=state_iter_py.txt |
| | 146 | Info[20260313-22:20:57]: CHECKPOINT_EVERY=20 |
| | 147 | Info[20260313-22:20:57]: MAX_ITER=500 |
| | 148 | Info[20260313-22:20:57]: MAX_RESTARTS=10 |
| | 149 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 150 | JobId=3300699 Name=ckpt_requeue_demo |
| | 151 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 152 | JobState=RUNNING Reason=None Dependency=(null) |
| | 153 | Requeue=1 Restarts=1 BatchFlag=1 ExitCode=0:0 |
| | 154 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 155 | StartTime=2026-03-13T22:20:57 EndTime=2026-03-13T22:23:57 |
| | 156 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 157 | === END JOB SNAPSHOT (scontrol) === |
| | 158 | Resuming from i=120 (iteration-based every 20) |
| | 159 | [periodic/iter] saved i=140 |
| | 160 | [periodic/iter] saved i=160 |
| | 161 | [periodic/iter] saved i=180 |
| | 162 | [periodic/iter] saved i=200 |
| | 163 | [periodic/iter] saved i=220 |
| | 164 | SIGTERM: saving i=240 and exiting 99 |
| | 165 | Info[20260313-22:22:57]: Program exit code (from timeout wrapper): 124 |
| | 166 | Info[20260313-22:22:57]: Timeout TERM observed; checkpoint advanced (120->240). Requeueing... |
| | 167 | Info[20260313-22:22:57]: Requeued via scontrol. |
| | 168 | Info[20260313-22:23:27]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=2 |
| | 169 | Info[20260313-22:23:27]: Settings: |
| | 170 | Info[20260313-22:23:27]: MODULE_LIST=anaconda3/2023.07 |
| | 171 | Info[20260313-22:23:27]: APP_CMD=python3 checkpoint_signal_iter.py |
| | 172 | Info[20260313-22:23:27]: LAUNCH_MODE=direct |
| | 173 | Info[20260313-22:23:27]: SRUN_ARGS=-n 1 |
| | 174 | Info[20260313-22:23:27]: TIME_LIMIT=00:03:00 |
| | 175 | Info[20260313-22:23:27]: MARGIN_SEC=60 |
| | 176 | Info[20260313-22:23:27]: CKPT_PATH=state_iter_py.txt |
| | 177 | Info[20260313-22:23:27]: CHECKPOINT_EVERY=20 |
| | 178 | Info[20260313-22:23:27]: MAX_ITER=500 |
| | 179 | Info[20260313-22:23:27]: MAX_RESTARTS=10 |
| | 180 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 181 | JobId=3300699 Name=ckpt_requeue_demo |
| | 182 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 183 | JobState=RUNNING Reason=None Dependency=(null) |
| | 184 | Requeue=1 Restarts=2 BatchFlag=1 ExitCode=0:0 |
| | 185 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 186 | StartTime=2026-03-13T22:23:27 EndTime=2026-03-13T22:26:27 |
| | 187 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 188 | === END JOB SNAPSHOT (scontrol) === |
| | 189 | Resuming from i=240 (iteration-based every 20) |
| | 190 | [periodic/iter] saved i=260 |
| | 191 | [periodic/iter] saved i=280 |
| | 192 | [periodic/iter] saved i=300 |
| | 193 | [periodic/iter] saved i=320 |
| | 194 | [periodic/iter] saved i=340 |
| | 195 | SIGTERM: saving i=360 and exiting 99 |
| | 196 | Info[20260313-22:25:27]: Program exit code (from timeout wrapper): 124 |
| | 197 | Info[20260313-22:25:27]: Timeout TERM observed; checkpoint advanced (240->360). Requeueing... |
| | 198 | Info[20260313-22:25:27]: Requeued via scontrol. |
| | 199 | Info[20260313-22:25:57]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=3 |
| | 200 | Info[20260313-22:25:57]: Settings: |
| | 201 | Info[20260313-22:25:57]: MODULE_LIST=anaconda3/2023.07 |
| | 202 | Info[20260313-22:25:57]: APP_CMD=python3 checkpoint_signal_iter.py |
| | 203 | Info[20260313-22:25:57]: LAUNCH_MODE=direct |
| | 204 | Info[20260313-22:25:57]: SRUN_ARGS=-n 1 |
| | 205 | Info[20260313-22:25:57]: TIME_LIMIT=00:03:00 |
| | 206 | Info[20260313-22:25:57]: MARGIN_SEC=60 |
| | 207 | Info[20260313-22:25:57]: CKPT_PATH=state_iter_py.txt |
| | 208 | Info[20260313-22:25:57]: CHECKPOINT_EVERY=20 |
| | 209 | Info[20260313-22:25:57]: MAX_ITER=500 |
| | 210 | Info[20260313-22:25:57]: MAX_RESTARTS=10 |
| | 211 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 212 | JobId=3300699 Name=ckpt_requeue_demo |
| | 213 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 214 | JobState=RUNNING Reason=None Dependency=(null) |
| | 215 | Requeue=1 Restarts=3 BatchFlag=1 ExitCode=0:0 |
| | 216 | RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A |
| | 217 | StartTime=2026-03-13T22:25:57 EndTime=2026-03-13T22:28:57 |
| | 218 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 219 | === END JOB SNAPSHOT (scontrol) === |
| | 220 | Resuming from i=360 (iteration-based every 20) |
| | 221 | [periodic/iter] saved i=380 |
| | 222 | [periodic/iter] saved i=400 |
| | 223 | [periodic/iter] saved i=420 |
| | 224 | [periodic/iter] saved i=440 |
| | 225 | [periodic/iter] saved i=460 |
| | 226 | SIGTERM: saving i=480 and exiting 99 |
| | 227 | Info[20260313-22:27:57]: Program exit code (from timeout wrapper): 124 |
| | 228 | Info[20260313-22:27:57]: Timeout TERM observed; checkpoint advanced (360->480). Requeueing... |
| | 229 | Info[20260313-22:27:57]: Requeued via scontrol. |
| | 230 | Info[20260313-22:28:20]: Start on cypress01-066; JOB_ID=3300699; RESTARTS=4 |
| | 231 | Info[20260313-22:28:20]: Settings: |
| | 232 | Info[20260313-22:28:20]: MODULE_LIST=anaconda3/2023.07 |
| | 233 | Info[20260313-22:28:20]: APP_CMD=python3 checkpoint_signal_iter.py |
| | 234 | Info[20260313-22:28:20]: LAUNCH_MODE=direct |
| | 235 | Info[20260313-22:28:20]: SRUN_ARGS=-n 1 |
| | 236 | Info[20260313-22:28:20]: TIME_LIMIT=00:03:00 |
| | 237 | Info[20260313-22:28:20]: MARGIN_SEC=60 |
| | 238 | Info[20260313-22:28:20]: CKPT_PATH=state_iter_py.txt |
| | 239 | Info[20260313-22:28:20]: CHECKPOINT_EVERY=20 |
| | 240 | Info[20260313-22:28:20]: MAX_ITER=500 |
| | 241 | Info[20260313-22:28:20]: MAX_RESTARTS=10 |
| | 242 | === BEGIN JOB SNAPSHOT (scontrol) === |
| | 243 | JobId=3300699 Name=ckpt_requeue_demo |
| | 244 | Priority=80808 Nice=0 Account=<groupID> QOS=normal |
| | 245 | JobState=RUNNING Reason=None Dependency=(null) |
| | 246 | Requeue=1 Restarts=4 BatchFlag=1 ExitCode=0:0 |
| | 247 | RunTime=00:00:02 TimeLimit=00:03:00 TimeMin=N/A |
| | 248 | StartTime=2026-03-13T22:28:18 EndTime=2026-03-13T22:31:18 |
| | 249 | Partition=centos7 AllocNode:Sid=cypress2:33768 |
| | 250 | === END JOB SNAPSHOT (scontrol) === |
| | 251 | Resuming from i=480 (iteration-based every 20) |
| | 252 | [periodic/iter] saved i=500 |
| | 253 | Reached i=501 > 500; exiting 0 |
| | 254 | Info[20260313-22:28:41]: Program exit code (from timeout wrapper): 0 |
| | 255 | Info[20260313-22:28:41]: Completed. |
| | 256 | }}} |