Context Navigation

← Previous Change
Wiki History
Next Change →

AfterYourJobHasCompleted

Timestamp:: 01/19/2026 10:03:40 PM (7 weeks ago)
Author:: Carl Baribault
Comment:: Added ideal, actual, and summary

Legend:

: Unmodified
: Added
: Removed
: Modified

Workshops/JobParallelism/AfterYourJobHasCompleted

-              v1
+              v2
 == Preliminary: tools available ==
 === LONI clusters ===
+=== On LONI clusters ===
 LONI clusters provide the self-contained commands '''seff''' and '''qshow'''.
 …
 [loniID@qbd2 ~]$ seff -v
 seff Version 2.1
  }}}
 …
  }}}
 === Cypress ===
+=== On Cypress ===
  In the following we'll need to use the '''sacct''' command for analyzing completed jobs on Cypress. (Cypress uses an older version of SLURM (v14.03.0) with insufficient support for the seff command.)
  Here are the relevant outputs that we'll need from '''sacct'''.
+ Here are the relevant outputs that we'll be using from '''sacct'''.
+||='''sacct''' output column=||=Description=||=Format=||
+||'''TotalCPU'''||Total core hours used||[DD-[hh:]]mm:ss)||
+||'''CPUTimeRAW'''||Total cores hours allocated||Seconds||
+||='''sacct''' output column=||='''Description'''=||='''Format'''=||='''Notes'''=||
+||'''TotalCPU'''||Total core hours used||[DD-[hh:]]mm:ss)||Needs conversion to seconds||
+||'''CPUTimeRAW'''||Total cores hours allocated||Seconds||No conversion needed||
+||'''REQMEM'''||Requested memory||GB or MB ||Defaults to 3200MB per core||
+||'''MaxRSS'''||Maximum memory used||GB per node||Sampled every 30 seconds on Cypress||
 == Cumulative core efficency: (total core hours used) / (total core hours allocated ==
+== Cumulative core efficency: (total core hours used) / (total core hours allocated) ==
+ foo
+=== Ideal case ===
+ Ideally we have '''TotalCPU''' = '''CPUTimeRAW''' such as the following.
+ * TotalCPU=20 hours, CPUTimeRAW=20 hours - using all 20 requested cores, full time for 1 hour
+ * Core efficiency = (20 hours TotalCPU / 20 hours CPUTimeRAW) = 1
+=== Actual case ===
+==== Using sacct ====
+ Here is the sacct command used to for a completed job where we've masked the job ID XXXXXXX
+{{{
+[tulaneID@cypress1 ~]$sacct  -P -n --format JobID,AllocCPUS,TotalCPU,CPUTimeRaw,REQMEM,MaxRSS -j XXXXXX
+XXXXXXX|10|11-04:18:08|1213660|128Gn|
+XXXXXXX.batch|1|11-04:18:08|121366|128Gn|3860640K
+}}}
+ In the following we'll use the values TotalCPU=11-04:18:08 and CPUTimeRAW=1213660 from the 2nd line, the XXXXXXX.batch step, in the above.
+==== Converting TotalCPU to seconds ====
+ We'll use the following shell function to convert '''TotalCPU''' in format [DD-[hh:]]mm:ss) to seconds.
+{{{
+[tulaneID@cypress1 ~]$convert_totalcpu_to_seconds() {
+   seconds=$(echo "$1" | awk -F'[:-]' '{
+      if (NF == 4) {
+          # Format: D-HH:MM:SS
+          total = ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4
+      } else if (NF == 3) {
+          # Format: HH:MM:SS or MM:SS (assumes HH:MM:SS)
+          total = ($1 * 3600) + ($2 * 60) + $3
+      } else if (NF == 2) {
+          # Format: MM:SS
+          total = ($1 * 60) + $2
+      } else {
+          total = $1 # Assume only seconds if no separators found
+      }
+      print total
+   }')
+   echo "$seconds"
+}
+[tulaneID@cypress1 ~]$convert_totalcpu_to_seconds 11-04:18:08
+}}}
+=== Compute cumulative core efficiency ===
+ Now that we have the job's '''TotalCPU''' in seconds, we can calculate the job's cumulative core efficiency.
+{{{
+[tulaneID@cypress1 ~]$bc <<< "scale=2; 965888 / 1213660"
+.79
+}}}
+=== Summary for this job ===
+==== Fewer requested resources = faster job queueing
+ In general, whenever a job can nonetheless run to completion in a comparable elapsed time but with less memory and/or fewer processors (cores and/or nodes) requested, then easier the resource manager SLURM will find an earlier time slot - if not immediately so - to queue (start and run) the job.
+==== Suggestions for requested processor count and RAM
+ * With the above result of 0.79, we conclude that not all 10 requested cores were in use throughout the duration of the job.
+   * We may be able to request fewer cores depending on the requirements of the parallel segments of the computation.
+   * We should consult the software provider's information.
+ * Also, the job used ~3.9GB ('''MaxRSS''') out of the requested 128GB ('''REQMEM''') of RAM
+  * We could easily expect to have the job run in the same amount of time requesting 10 cores and greatly reduced memory, say, '''!--mem=32000''' or 32GB.