[[PageOutline]] = While your job is running - determining current core efficiency = == Assumptions == * Request sufficient processor resources For running jobs let's assume that you've requested sufficient processor resources via the following. (See '''man sbatch'''.) ||='''Description'''=||='''SBATCH options'''=||='''Default value'''=||='''Maximum value'''=|| || # of nodes || -N, --nodes || Subject to -n and -c options || See [wiki:cypress/about#SLURMresourcemanager SLURM (resource manager)]|| || # of tasks || -n, --ntasks || 1 task per node || 20 * (# of nodes) || || # of cores/CPUs/processors per tasks || -c, --cpus-per-task || 1 core per task || 20 || || total Random Access Memory (RAM) || --mem || 1 core per task || 64/128/256GB || || RAM per core || --mem-per-cpu || 3200MB || " " " || * Your job's memory requirement may be greater than your (# cores) * (Total RAM)/20. * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM. ||!--ntasks=1|| ||!--cpus-per-task=10|| ||!--mem=128|| == Current core efficiency for running jobs: (actual core usage) / (requested core allocation) === Example 1: an idev job for an idle interactive session === 1. Start an idev interactive session. {{{ [tulaneID@cypress1 ~]$idev --partition=centos7 Requesting 1 node(s) task(s) to normal queue of centos7 partition 1 task(s)/node, 20 cpu(s)/task, 0 MIC device(s)/node Time: 0 (hr) 60 (min). 0d 0h 60m Submitted batch job 3272175 JOBID=3289908 begin on cypress01-066 --> Creating interactive terminal session (login) on node cypress01-066. --> You have 0 (hr) 60 (min). --> Assigned Host List : /tmp/idev_nodes_file_tulaneID Last login: Thu Jan 15 14:42:40 2026 from cypress2.cm.cluster }}} '''For workshop''' using only 2 requested cores {{{ [tulaneID@cypress1 ~]$idev --partition=workshop7 -c 2 }}} 2. Login to Cypress in a separate terminal session, and use the SLURM '''squeue''' command to determine the job's node list - in this case an idle interactive session. {{{ [tulaneID@cypress1 ~]$squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3272175 centos7 idv38428 tulaneID R 3:54 1 cypress01-059 }}} 3. For each job node, use the '''ssh''' and '''top''' commands in combination to determine the job's core usage on the given node such as the following. (See '''man top'''.) Here are the relevant output columns for the '''top''' command. ||='''top''' command output column=||='''Description'''=||='''Notes'''=|| ||%CPU||percentage of cores used per job process (100% per full core used)||sum(%CPU)/100=fractional # of cores in use on the node|| ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use|| Here's the combined command and result. {{{ [tulaneID@cypress1 ~]$ssh cypress01-059 top -b -n 1 -u $USER top - 00:34:01 up 75 days, 9:23, 1 user, load average: 0.06, 0.05, 0.01 Tasks: 730 total, 1 running, 729 sleeping, 0 stopped, 0 zombie Cpu(s): 52.7%us, 0.7%sy, 0.0%ni, 46.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 66013252k total, 3374764k used, 62638488k free, 137060k buffers Swap: 12582904k total, 0k used, 12582904k free, 1280248k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31502 tulaneID 20 0 27880 1772 956 R 3.8 0.0 0:00.09 top 30927 tulaneID 20 0 9200 1244 1044 S 0.0 0.0 0:00.00 slurm_script 30953 tulaneID 20 0 4072 544 464 S 0.0 0.0 0:00.00 sleep 30970 tulaneID 20 0 144m 2368 1164 S 0.0 0.0 0:00.00 sshd 30971 tulaneID 20 0 25092 3100 1516 S 0.0 0.0 0:00.06 bash 31501 tulaneID 20 0 144m 2316 1132 S 0.0 0.0 0:00.00 sshd }}} 4. Next we'll re-run the same combined '''ssh...top...''' command and pipe the input to '''awk''' in order to sum the values in the columns %CPU, %MEM. {{{ [tulaneID@cypress1 ~]$ssh cypress01-065 'top -b -n 1 -u $USER' | \ awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' Total %CPU: 3.8 Total %MEM: 0 }}} 5. If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is {{{ [tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 20" .0019 }}} This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores. '''For workshop''' using only 2 requested cores {{{ [tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 2" .0190 }}} === Example 2: a running batch job using R requesting 1 node === ==== Prepare sample R code ==== For this and the following example, download and make a copy of the sample R code via the following. {{{ [tulaneID@cypress1 ~]$git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git [tulaneID@cypress1 ~]$cp -r hpc-workshop/R/* . [tulaneID@cypress1 ~]$ls bootstrap.R bootstrap.sh bootstrapWargs.R bootstrapWargs.sh myRscript.R slurmscript1 slurmscript2 }}} For demonstration purposes, the downloaded and copied R script, '''bootstrap.R''' (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]), has been modified to run 1000000 (1M) samples rather than the original 10000 (10K) samples. {{{ [tulaneID@cypress1 ~]$diff bootstrap.R hpc-workshop/R/bootstrap.R 10c10 < iterations <- 1000000# Number of iterations to run --- > iterations <- 10000# Number of iterations to run bootstrap.R bootstrap.sh bootstrapWargs.R bootstrapWargs.sh myRscript.R slurmscript1 slurmscript2 }}} ==== Submit the test batch job ==== The job script, '''bootstrap.sh''', is requesting 1 node and 16 cores. {{{ [tulaneID@cypress1 ~]$grep cpus-per-task bootstrap.sh #SBATCH --cpus-per-task=16 # Number of threads per task (OMP threads) [tulaneID@cypress1 ~]$sbatch bootstrap.sh Submitted batch job 3289740 [tulaneID@cypress1 ~]$squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3289740 workshop7 R tulaneID R 0:00 1 cypress01-009 # note - the following result after many attempts of top with result only %CPU ~= 100.0 [tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \ awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' Total %CPU: 1556.6 Total %MEM: 3.3 }}} ==== Calculate core efficiency ==== The resulting core efficiency is {{{ [tulaneID@cypress1 ~]$bc <<< "scale=2;(1556.3 / 100) / 16" .97 }}} This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores. === Example 3: same R code requesting 2 nodes - 1 node unused === The following uses the same the R sampling code as above (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores and 2 nodes (--nodes - one of which is unused. {{{ [tulaneID@cypress1 ~]$diff bootstrap.sh bootstrap2nodes.sh 7c7 < #SBATCH --nodes=1 # Number of Nodes --- > #SBATCH --nodes=2 # Number of Nodes [tulaneID@cypress1 ~]$sbatch bootstrap2nodes.sh Submitted batch job 3289779 [tulaneID@cypress1 ~]$squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3289779 workshop7 R tulaneID R 0:03 2 cypress01-[009-010] # use the following to list job nodes separately [tulandID@cypress1 R]$scontrol show hostname cypress01-[009-010] cypress01-009 cypress01-010 [tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \ awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' Total %CPU: 1587.6 Total %MEM: 3.3 [tulaneID@cypress1 ~]$ssh cypress01-010 'top -b -n 1 -u $USER' | \ awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' Total %CPU: 13.3 Total %MEM: 0 }}} The resulting core efficiency for each of the two requested nodes is {{{ [tulaneID@cypress1 ~]$bc <<< "scale=3; (1587.6 / 100) / 16" .992 [tulaneID@cypress1 ~]$bc <<< "scale=3; (13.3 / 100) / 16" .008 }}} Result: * On the first node, cypress01-009, usage is '''nearly ideal''' (.992 ~= 1.0). * On the second node, cypress01-010, usage is '''nearly non-existent''' (.008 ~= 0.0). === Running R on multiple nodes === For information on how to run R code on multiple nodes on a SLURM cluster, see [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].