| Version 2 (modified by , 8 hours ago) ( diff ) |
|---|
After your job has completed - determining cumulative core efficiency
Assumptions
See Assumptions - same as for running jobs.
Preliminary: tools available
On LONI clusters
LONI clusters provide the self-contained commands seff and qshow.
- seff (See seff on github.)
On LONI QB4 cluster:
[loniID@qbd2 ~]$ seff -h
Usage: seff [Options] <Jobid>
Options:
-h Help menu
-v Version
-d Debug mode: display raw Slurm data
[loniID@qbd2 ~]$ seff -v
seff Version 2.1
- qshow (provided by LONI)
On LONI QB4 cluster:
[loniID@qbd2 ~]$ qshow -h ** usage: qshow -n <options> <base-name> <begin #> <end #> <command> ... Show and optionally kill user processes on remote nodes or execute commands... [loniID@qbd2 ~]$ qshow -v qshow 2.74
On Cypress
In the following we'll need to use the sacct command for analyzing completed jobs on Cypress. (Cypress uses an older version of SLURM (v14.03.0) with insufficient support for the seff command.)
Here are the relevant outputs that we'll be using from sacct.
| sacct output column | Description | Format | Notes |
|---|---|---|---|
| TotalCPU | Total core hours used | [DD-[hh:]]mm:ss) | Needs conversion to seconds |
| CPUTimeRAW | Total cores hours allocated | Seconds | No conversion needed |
| REQMEM | Requested memory | GB or MB | Defaults to 3200MB per core |
| MaxRSS | Maximum memory used | GB per node | Sampled every 30 seconds on Cypress |
Cumulative core efficency: (total core hours used) / (total core hours allocated)
Ideal case
Ideally we have TotalCPU = CPUTimeRAW such as the following.
- TotalCPU=20 hours, CPUTimeRAW=20 hours - using all 20 requested cores, full time for 1 hour
- Core efficiency = (20 hours TotalCPU / 20 hours CPUTimeRAW) = 1
Actual case
Using sacct
Here is the sacct command used to for a completed job where we've masked the job ID XXXXXXX
[tulaneID@cypress1 ~]$sacct -P -n --format JobID,AllocCPUS,TotalCPU,CPUTimeRaw,REQMEM,MaxRSS -j XXXXXX XXXXXXX|10|11-04:18:08|1213660|128Gn| XXXXXXX.batch|1|11-04:18:08|121366|128Gn|3860640K
In the following we'll use the values TotalCPU=11-04:18:08 and CPUTimeRAW=1213660 from the 2nd line, the XXXXXXX.batch step, in the above.
Converting TotalCPU to seconds
We'll use the following shell function to convert TotalCPU in format [DD-[hh:]]mm:ss) to seconds.
[tulaneID@cypress1 ~]$convert_totalcpu_to_seconds() {
seconds=$(echo "$1" | awk -F'[:-]' '{
if (NF == 4) {
# Format: D-HH:MM:SS
total = ($1 * 86400) + ($2 * 3600) + ($3 * 60) + $4
} else if (NF == 3) {
# Format: HH:MM:SS or MM:SS (assumes HH:MM:SS)
total = ($1 * 3600) + ($2 * 60) + $3
} else if (NF == 2) {
# Format: MM:SS
total = ($1 * 60) + $2
} else {
total = $1 # Assume only seconds if no separators found
}
print total
}')
echo "$seconds"
}
[tulaneID@cypress1 ~]$convert_totalcpu_to_seconds 11-04:18:08
965888
Compute cumulative core efficiency
Now that we have the job's TotalCPU in seconds, we can calculate the job's cumulative core efficiency.
[tulaneID@cypress1 ~]$bc <<< "scale=2; 965888 / 1213660" .79
Summary for this job
Fewer requested resources = faster job queueing
In general, whenever a job can nonetheless run to completion in a comparable elapsed time but with less memory and/or fewer processors (cores and/or nodes) requested, then easier the resource manager SLURM will find an earlier time slot - if not immediately so - to queue (start and run) the job.
Suggestions for requested processor count and RAM
- With the above result of 0.79, we conclude that not all 10 requested cores were in use throughout the duration of the job.
- We may be able to request fewer cores depending on the requirements of the parallel segments of the computation.
- We should consult the software provider's information.
- Also, the job used ~3.9GB (MaxRSS) out of the requested 128GB (REQMEM) of RAM
- We could easily expect to have the job run in the same amount of time requesting 10 cores and greatly reduced memory, say, --mem=32000 or 32GB.
