Changes between Initial Version and Version 1 of Workshops/JobParallelism/WhileYourJobIsRunning


Ignore:
Timestamp:
01/18/26 20:16:26 (2 days ago)
Author:
Carl Baribault
Comment:

first iteration for the page

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobParallelism/WhileYourJobIsRunning

    v1 v1  
     1[[PageOutline]]
     2= While your job is running =
     3
     4== Assumptions ==
     5* Request sufficient processor resources
     6
     7 For running jobs let's assume that you've requested sufficient processor resources via the following. (See '''man sbatch'''.)
     8 ||='''Description'''=||='''SBATCH options'''=||='''Default value'''=||='''Maximum value'''=||
     9 || # of nodes || -N, --nodes || Subject to -n and -c options || See [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)]||
     10 || # of tasks || -n, --ntasks || 1 task per node || 20 * (# of nodes) ||
     11 || # of cores/CPUs/processors per tasks || -c, --cpus-per-task || 1 core per task || 20 ||
     12 || total Random Access Memory (RAM) || --mem || 1 core per task || 64/128/256GB ||
     13 || RAM per core || --mem-per-cpu || 3200MB || " " " ||
     14
     15* Your job's memory requirement may be greater than your (# cores) * (Total RAM)/20.
     16 * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM.
     17
     18  ||--ntasks=1||
     19  ||--cpus-per-task=10||
     20  ||--mem=128||
     21
     22== Example 1: an idev job's core efficiency: (actual core usage) / (requested core allocation)
     23 1. Log in to Cypress.
     24 2. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session.
     25 
     26{{{
     27[tulaneID@cypress1 ~]$squeue -u $USER
     28             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     29           3272175   centos7 idv38428 tulaneID  R       3:54      1 cypress01-059
     30}}}
     31 3. For each job node, use the ssh and top commands in combination to determine your job's core usage on the given node such as the following. (See '''man top'''.)
     32
     33 Here are the relevant output columns for the '''top''' command.
     34
     35 ||='''top''' output column=||=Description=||=Notes=||
     36 ||%CPU||percentage of cores used per job process||sum(%CPU)/100=fractional # of cores in use on the node||
     37 ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use||
     38
     39 Here's the combined command and result.
     40
     41{{{
     42[tulaneID@cypress1 ~]$ssh cypress01-059 top -b -n 1 -u $USER
     43top - 00:34:01 up 75 days,  9:23,  1 user,  load average: 0.06, 0.05, 0.01
     44Tasks: 730 total,   1 running, 729 sleeping,   0 stopped,   0 zombie
     45Cpu(s): 52.7%us,  0.7%sy,  0.0%ni, 46.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
     46Mem:  66013252k total,  3374764k used, 62638488k free,   137060k buffers
     47Swap: 12582904k total,        0k used, 12582904k free,  1280248k cached
     48
     49   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
     50 31502 tulaneID  20   0 27880 1772  956 R  3.8  0.0   0:00.09 top
     51 30927 tulaneID  20   0  9200 1244 1044 S  0.0  0.0   0:00.00 slurm_script
     52 30953 tulaneID  20   0  4072  544  464 S  0.0  0.0   0:00.00 sleep
     53 30970 tulaneID  20   0  144m 2368 1164 S  0.0  0.0   0:00.00 sshd
     54 30971 tulaneID  20   0 25092 3100 1516 S  0.0  0.0   0:00.06 bash
     55 31501 tulaneID  20   0  144m 2316 1132 S  0.0  0.0   0:00.00 sshd
     56
     57}}}
     58
     59 4. Next we'll re-run the same combined '''ssh...top...''' command and pipe the input to '''awk''' in order to sum the values in the columns %CPU, %MEM.
     60
     61{{{
     62[tulaneID@cypress1 ~]$ssh cypress01-065 'top -b -n 1 -u $USER' | \
     63   awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
     64   END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
     65Total %CPU: 3.8
     66Total %MEM: 0
     67
     68}}}
     69 5. If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is
     70{{{
     71(3.8 / 100) / 20 = 0.0019
     72}}}
     73  This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores.
     74
     75== Example 2: a running batch job's core efficiency ==
     76
     77 The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores.
     78
     79{{{
     80[tulaneID@cypress1 R]$sbatch bootstrap.sh
     81Submitted batch job 3289740
     82[tulaneID@cypress1 R]$squeue -u $USER
     83             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     84           3289740 workshop7        R tulaneID  R       0:00      1 cypress01-009
     85[tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
     86awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
     87END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
     88Total %CPU: 1556.6
     89Total %MEM: 3.3
     90
     91}}}
     92
     93 The resulting core efficiency is
     94{{{
     95(1556.3 / 100) / 16 = 0.97
     96}}}
     97 This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores.