Context Navigation

← Previous Change
Wiki History
Next Change →

WhileYourJobIsRunning

Timestamp:: 01/18/2026 08:16:26 PM (8 weeks ago)
Author:: Carl Baribault
Comment:: first iteration for the page

Legend:

: Unmodified
: Added
: Removed
: Modified

Workshops/JobParallelism/WhileYourJobIsRunning

               v1
+[[PageOutline]]
+= While your job is running =
+== Assumptions ==
+* Request sufficient processor resources
+ For running jobs let's assume that you've requested sufficient processor resources via the following. (See '''man sbatch'''.)
+ ||='''Description'''=||='''SBATCH options'''=||='''Default value'''=||='''Maximum value'''=||
+ || # of nodes || -N, --nodes || Subject to -n and -c options || See [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)]||
+ || # of tasks || -n, --ntasks || 1 task per node || 20 * (# of nodes) ||
+ || # of cores/CPUs/processors per tasks || -c, --cpus-per-task || 1 core per task || 20 ||
+ || total Random Access Memory (RAM) || --mem || 1 core per task || 64/128/256GB ||
+ || RAM per core || --mem-per-cpu || 3200MB || " " " ||
+* Your job's memory requirement may be greater than your (# cores) * (Total RAM)/20.
+ * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM.
+  ||--ntasks=1||
+  ||--cpus-per-task=10||
+  ||--mem=128||
+== Example 1: an idev job's core efficiency: (actual core usage) / (requested core allocation)
+. Log in to Cypress.
+. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session.
+{{{
+[tulaneID@cypress1 ~]$squeue -u $USER
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+           3272175   centos7 idv38428 tulaneID  R       3:54      1 cypress01-059
+}}}
+. For each job node, use the ssh and top commands in combination to determine your job's core usage on the given node such as the following. (See '''man top'''.)
+ Here are the relevant output columns for the '''top''' command.
+ ||='''top''' output column=||=Description=||=Notes=||
+ ||%CPU||percentage of cores used per job process||sum(%CPU)/100=fractional # of cores in use on the node||
+ ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use||
+ Here's the combined command and result.
+{{{
+[tulaneID@cypress1 ~]$ssh cypress01-059 top -b -n 1 -u $USER
+top - 00:34:01 up 75 days,  9:23,  1 user,  load average: 0.06, 0.05, 0.01
+Tasks: 730 total,   1 running, 729 sleeping,   0 stopped,   0 zombie
+Cpu(s): 52.7%us,  0.7%sy,  0.0%ni, 46.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
+Mem:  66013252k total,  3374764k used, 62638488k free,   137060k buffers
+Swap: 12582904k total,        0k used, 12582904k free,  1280248k cached
+   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
+tulaneID  20   0 27880 1772  956 R  3.8  0.0   0:00.09 top
+tulaneID  20   0  9200 1244 1044 S  0.0  0.0   0:00.00 slurm_script
+tulaneID  20   0  4072  544  464 S  0.0  0.0   0:00.00 sleep
+tulaneID  20   0  144m 2368 1164 S  0.0  0.0   0:00.00 sshd
+tulaneID  20   0 25092 3100 1516 S  0.0  0.0   0:00.06 bash
+tulaneID  20   0  144m 2316 1132 S  0.0  0.0   0:00.00 sshd
+}}}
+. Next we'll re-run the same combined '''ssh...top...''' command and pipe the input to '''awk''' in order to sum the values in the columns %CPU, %MEM.
+{{{
+[tulaneID@cypress1 ~]$ssh cypress01-065 'top -b -n 1 -u $USER' | \
+   awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
+   END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
+Total %CPU: 3.8
+Total %MEM: 0
+}}}
+. If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is
+{{{
+(3.8 / 100) / 20 = 0.0019
+}}}
+  This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores.
+== Example 2: a running batch job's core efficiency ==
+ The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores.
+{{{
+[tulaneID@cypress1 R]$sbatch bootstrap.sh
+Submitted batch job 3289740
+[tulaneID@cypress1 R]$squeue -u $USER
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+           3289740 workshop7        R tulaneID  R       0:00      1 cypress01-009
+[tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
+awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
+END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
+Total %CPU: 1556.6
+Total %MEM: 3.3
+}}}
+ The resulting core efficiency is
+{{{
+(1556.3 / 100) / 16 = 0.97
+}}}
+ This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores.