Changes between Version 1 and Version 2 of Workshops/JobParallelism/AfterYourJobHasCompleted


Ignore:
Timestamp:
01/19/26 22:03:40 (8 hours ago)
Author:
Carl Baribault
Comment:

Added ideal, actual, and summary

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobParallelism/AfterYourJobHasCompleted

    v1 v2  
    77== Preliminary: tools available ==
    88
    9 === LONI clusters ===
     9=== On LONI clusters ===
    1010
    1111LONI clusters provide the self-contained commands '''seff''' and '''qshow'''.
     
    2424[loniID@qbd2 ~]$ seff -v
    2525seff Version 2.1
    26 
    2726 }}}
    2827
     
    4140 }}}
    4241
    43 === Cypress ===
     42=== On Cypress ===
    4443
    4544 In the following we'll need to use the '''sacct''' command for analyzing completed jobs on Cypress. (Cypress uses an older version of SLURM (v14.03.0) with insufficient support for the seff command.)
    4645
    47  Here are the relevant outputs that we'll need from '''sacct'''.
     46 Here are the relevant outputs that we'll be using from '''sacct'''.
    4847
    49 ||='''sacct''' output column=||=Description=||=Format=||
    50 ||'''TotalCPU'''||Total core hours used||[DD-[hh:]]mm:ss)||
    51 ||'''CPUTimeRAW'''||Total cores hours allocated||Seconds||
     48||='''sacct''' output column=||='''Description'''=||='''Format'''=||='''Notes'''=||
     49||'''TotalCPU'''||Total core hours used||[DD-[hh:]]mm:ss)||Needs conversion to seconds||
     50||'''CPUTimeRAW'''||Total cores hours allocated||Seconds||No conversion needed||
     51||'''REQMEM'''||Requested memory||GB or MB ||Defaults to 3200MB per core||
     52||'''MaxRSS'''||Maximum memory used||GB per node||Sampled every 30 seconds on Cypress||
    5253
    53 == Cumulative core efficency: (total core hours used) / (total core hours allocated ==
     54== Cumulative core efficency: (total core hours used) / (total core hours allocated) ==
    5455
    55  foo
     56=== Ideal case ===
     57
     58 Ideally we have '''TotalCPU''' = '''CPUTimeRAW''' such as the following.
     59
     60 * TotalCPU=20 hours, CPUTimeRAW=20 hours - using all 20 requested cores, full time for 1 hour
     61 * Core efficiency = (20 hours TotalCPU / 20 hours CPUTimeRAW) = 1
     62
     63=== Actual case ===
     64
     65==== Using sacct ====
     66
     67 Here is the sacct command used to for a completed job where we've masked the job ID XXXXXXX
     68
     69{{{
     70[tulaneID@cypress1 ~]$sacct  -P -n --format JobID,AllocCPUS,TotalCPU,CPUTimeRaw,REQMEM,MaxRSS -j XXXXXX
     71XXXXXXX|10|11-04:18:08|1213660|128Gn|
     72XXXXXXX.batch|1|11-04:18:08|121366|128Gn|3860640K
     73}}}
     74
     75 In the following we'll use the values TotalCPU=11-04:18:08 and CPUTimeRAW=1213660 from the 2nd line, the XXXXXXX.batch step, in the above.
     76
     77==== Converting TotalCPU to seconds ====
     78
     79 We'll use the following shell function to convert '''TotalCPU''' in format [DD-[hh:]]mm:ss) to seconds.
     80
     81
     82{{{
     83[tulaneID@cypress1 ~]$convert_totalcpu_to_seconds() {
     84   seconds=$(echo "$1" | awk -F'[:-]' '{
     85      if (NF == 4) {
     86          # Format: D-HH:MM:SS
     87          total = ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4
     88      } else if (NF == 3) {
     89          # Format: HH:MM:SS or MM:SS (assumes HH:MM:SS)
     90          total = ($1 * 3600) + ($2 * 60) + $3
     91      } else if (NF == 2) {
     92          # Format: MM:SS
     93          total = ($1 * 60) + $2
     94      } else {
     95          total = $1 # Assume only seconds if no separators found
     96      }
     97      print total
     98   }')
     99
     100   echo "$seconds"
     101}
     102[tulaneID@cypress1 ~]$convert_totalcpu_to_seconds 11-04:18:08
     103965888
     104}}}
     105
     106=== Compute cumulative core efficiency ===
     107
     108 Now that we have the job's '''TotalCPU''' in seconds, we can calculate the job's cumulative core efficiency.
     109
     110{{{
     111[tulaneID@cypress1 ~]$bc <<< "scale=2; 965888 / 1213660"
     112.79
     113}}}
     114
     115=== Summary for this job ===
     116
     117==== Fewer requested resources = faster job queueing
     118
     119 In general, whenever a job can nonetheless run to completion in a comparable elapsed time but with less memory and/or fewer processors (cores and/or nodes) requested, then easier the resource manager SLURM will find an earlier time slot - if not immediately so - to queue (start and run) the job.
     120
     121==== Suggestions for requested processor count and RAM
     122
     123 * With the above result of 0.79, we conclude that not all 10 requested cores were in use throughout the duration of the job.
     124   * We may be able to request fewer cores depending on the requirements of the parallel segments of the computation.
     125   * We should consult the software provider's information.
     126 * Also, the job used ~3.9GB ('''MaxRSS''') out of the requested 128GB ('''REQMEM''') of RAM
     127  * We could easily expect to have the job run in the same amount of time requesting 10 cores and greatly reduced memory, say, '''!--mem=32000''' or 32GB.