| | 1 | [[PageOutline]] |
| | 2 | = While your job is running = |
| | 3 | |
| | 4 | == Assumptions == |
| | 5 | * Request sufficient processor resources |
| | 6 | |
| | 7 | For running jobs let's assume that you've requested sufficient processor resources via the following. (See '''man sbatch'''.) |
| | 8 | ||='''Description'''=||='''SBATCH options'''=||='''Default value'''=||='''Maximum value'''=|| |
| | 9 | || # of nodes || -N, --nodes || Subject to -n and -c options || See [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)]|| |
| | 10 | || # of tasks || -n, --ntasks || 1 task per node || 20 * (# of nodes) || |
| | 11 | || # of cores/CPUs/processors per tasks || -c, --cpus-per-task || 1 core per task || 20 || |
| | 12 | || total Random Access Memory (RAM) || --mem || 1 core per task || 64/128/256GB || |
| | 13 | || RAM per core || --mem-per-cpu || 3200MB || " " " || |
| | 14 | |
| | 15 | * Your job's memory requirement may be greater than your (# cores) * (Total RAM)/20. |
| | 16 | * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM. |
| | 17 | |
| | 18 | ||--ntasks=1|| |
| | 19 | ||--cpus-per-task=10|| |
| | 20 | ||--mem=128|| |
| | 21 | |
| | 22 | == Example 1: an idev job's core efficiency: (actual core usage) / (requested core allocation) |
| | 23 | 1. Log in to Cypress. |
| | 24 | 2. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session. |
| | 25 | |
| | 26 | {{{ |
| | 27 | [tulaneID@cypress1 ~]$squeue -u $USER |
| | 28 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| | 29 | 3272175 centos7 idv38428 tulaneID R 3:54 1 cypress01-059 |
| | 30 | }}} |
| | 31 | 3. For each job node, use the ssh and top commands in combination to determine your job's core usage on the given node such as the following. (See '''man top'''.) |
| | 32 | |
| | 33 | Here are the relevant output columns for the '''top''' command. |
| | 34 | |
| | 35 | ||='''top''' output column=||=Description=||=Notes=|| |
| | 36 | ||%CPU||percentage of cores used per job process||sum(%CPU)/100=fractional # of cores in use on the node|| |
| | 37 | ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use|| |
| | 38 | |
| | 39 | Here's the combined command and result. |
| | 40 | |
| | 41 | {{{ |
| | 42 | [tulaneID@cypress1 ~]$ssh cypress01-059 top -b -n 1 -u $USER |
| | 43 | top - 00:34:01 up 75 days, 9:23, 1 user, load average: 0.06, 0.05, 0.01 |
| | 44 | Tasks: 730 total, 1 running, 729 sleeping, 0 stopped, 0 zombie |
| | 45 | Cpu(s): 52.7%us, 0.7%sy, 0.0%ni, 46.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st |
| | 46 | Mem: 66013252k total, 3374764k used, 62638488k free, 137060k buffers |
| | 47 | Swap: 12582904k total, 0k used, 12582904k free, 1280248k cached |
| | 48 | |
| | 49 | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
| | 50 | 31502 tulaneID 20 0 27880 1772 956 R 3.8 0.0 0:00.09 top |
| | 51 | 30927 tulaneID 20 0 9200 1244 1044 S 0.0 0.0 0:00.00 slurm_script |
| | 52 | 30953 tulaneID 20 0 4072 544 464 S 0.0 0.0 0:00.00 sleep |
| | 53 | 30970 tulaneID 20 0 144m 2368 1164 S 0.0 0.0 0:00.00 sshd |
| | 54 | 30971 tulaneID 20 0 25092 3100 1516 S 0.0 0.0 0:00.06 bash |
| | 55 | 31501 tulaneID 20 0 144m 2316 1132 S 0.0 0.0 0:00.00 sshd |
| | 56 | |
| | 57 | }}} |
| | 58 | |
| | 59 | 4. Next we'll re-run the same combined '''ssh...top...''' command and pipe the input to '''awk''' in order to sum the values in the columns %CPU, %MEM. |
| | 60 | |
| | 61 | {{{ |
| | 62 | [tulaneID@cypress1 ~]$ssh cypress01-065 'top -b -n 1 -u $USER' | \ |
| | 63 | awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ |
| | 64 | END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' |
| | 65 | Total %CPU: 3.8 |
| | 66 | Total %MEM: 0 |
| | 67 | |
| | 68 | }}} |
| | 69 | 5. If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is |
| | 70 | {{{ |
| | 71 | (3.8 / 100) / 20 = 0.0019 |
| | 72 | }}} |
| | 73 | This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores. |
| | 74 | |
| | 75 | == Example 2: a running batch job's core efficiency == |
| | 76 | |
| | 77 | The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores. |
| | 78 | |
| | 79 | {{{ |
| | 80 | [tulaneID@cypress1 R]$sbatch bootstrap.sh |
| | 81 | Submitted batch job 3289740 |
| | 82 | [tulaneID@cypress1 R]$squeue -u $USER |
| | 83 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| | 84 | 3289740 workshop7 R tulaneID R 0:00 1 cypress01-009 |
| | 85 | [tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \ |
| | 86 | awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ |
| | 87 | END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' |
| | 88 | Total %CPU: 1556.6 |
| | 89 | Total %MEM: 3.3 |
| | 90 | |
| | 91 | }}} |
| | 92 | |
| | 93 | The resulting core efficiency is |
| | 94 | {{{ |
| | 95 | (1556.3 / 100) / 16 = 0.97 |
| | 96 | }}} |
| | 97 | This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores. |