Changes between Version 6 and Version 7 of Workshops/JobParallelism/WhileYourJobIsRunning
- Timestamp:
- 01/19/26 20:47:50 (9 hours ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Workshops/JobParallelism/WhileYourJobIsRunning
v6 v7 16 16 * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM. 17 17 18 || --ntasks=1||19 || --cpus-per-task=10||20 || --mem=128||18 ||!--ntasks=1|| 19 ||!--cpus-per-task=10|| 20 ||!--mem=128|| 21 21 22 22 == Current core efficiency for running jobs: (actual core usage) / (requested core allocation) 23 23 24 === Example 1: an idev job for an idle interactive session 25 1. Log in to Cypress. 26 2. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session. 24 === Example 1: an idev job for an idle interactive session === 25 26 1. Start an idev interactive session. 27 28 {{{ 29 [tulaneID@cypress1 ~]$idev --partition=centos7 30 Requesting 1 node(s) task(s) to normal queue of centos7 partition 31 1 task(s)/node, 20 cpu(s)/task, 0 MIC device(s)/node 32 Time: 0 (hr) 60 (min). 33 0d 0h 60m 34 Submitted batch job 3272175 35 JOBID=3289908 begin on cypress01-066 36 --> Creating interactive terminal session (login) on node cypress01-066. 37 --> You have 0 (hr) 60 (min). 38 --> Assigned Host List : /tmp/idev_nodes_file_tulaneID 39 Last login: Thu Jan 15 14:42:40 2026 from cypress2.cm.cluster 40 }}} 41 42 '''For workshop''' using only 2 requested cores 43 44 {{{ 45 [tulaneID@cypress1 ~]$idev --partition=workshop7 -c 2 46 }}} 47 48 2. Login to Cypress in a separate terminal session, and use the SLURM '''squeue''' command to determine the job's node list - in this case an idle interactive session. 27 49 28 50 {{{ … … 31 53 3272175 centos7 idv38428 tulaneID R 3:54 1 cypress01-059 32 54 }}} 33 3. For each job node, use the ssh and top commands in combination to determine yourjob's core usage on the given node such as the following. (See '''man top'''.)55 3. For each job node, use the '''ssh''' and '''top''' commands in combination to determine the job's core usage on the given node such as the following. (See '''man top'''.) 34 56 35 57 Here are the relevant output columns for the '''top''' command. 36 58 37 ||='''top''' output column=||=Description=||=Notes=||38 ||%CPU||percentage of cores used per job process ||sum(%CPU)/100=fractional # of cores in use on the node||59 ||='''top''' command output column=||='''Description'''=||='''Notes'''=|| 60 ||%CPU||percentage of cores used per job process (100% per full core used)||sum(%CPU)/100=fractional # of cores in use on the node|| 39 61 ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use|| 40 62 … … 69 91 70 92 }}} 93 71 94 5. If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is 72 {{{ 73 (3.8 / 100) / 20 = 0.0019 74 }}} 95 96 {{{ 97 [tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 20" 98 .0019 99 }}} 100 75 101 This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores. 76 102 103 '''For workshop''' using only 2 requested cores 104 105 {{{ 106 [tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 2" 107 .0190 108 }}} 109 77 110 === Example 2: a running batch job using R requesting 1 node === 78 111 79 The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores. 80 81 {{{ 82 [tulaneID@cypress1 R]$sbatch bootstrap.sh 112 ==== Prepare sample R code ==== 113 114 For this and the following example, download and make a copy of the sample R code via the following. 115 116 {{{ 117 [tulaneID@cypress1 ~]$git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git 118 [tulaneID@cypress1 ~]$cp -r hpc-workshop/R/* . 119 [tulaneID@cypress1 ~]$ls 120 bootstrap.R bootstrap.sh bootstrapWargs.R bootstrapWargs.sh myRscript.R slurmscript1 slurmscript2 121 }}} 122 123 For demonstration purposes, the downloaded and copied R script, '''bootstrap.R''' (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]), has been modified to run 1000000 (1M) samples rather than the original 10000 (10K) samples. 124 125 {{{ 126 [tulaneID@cypress1 ~]$diff bootstrap.R hpc-workshop/R/bootstrap.R 127 10c10 128 < iterations <- 1000000# Number of iterations to run 129 --- 130 > iterations <- 10000# Number of iterations to run 131 bootstrap.R bootstrap.sh bootstrapWargs.R bootstrapWargs.sh myRscript.R slurmscript1 slurmscript2 132 }}} 133 134 ==== Submit the test batch job ==== 135 136 The job script, '''bootstrap.sh''', is requesting 1 node and 16 cores. 137 138 {{{ 139 [tulaneID@cypress1 ~]$grep cpus-per-task bootstrap.sh 140 #SBATCH --cpus-per-task=16 # Number of threads per task (OMP threads) 141 [tulaneID@cypress1 ~]$sbatch bootstrap.sh 83 142 Submitted batch job 3289740 84 [tulaneID@cypress1 R]$squeue -u $USER143 [tulaneID@cypress1 ~]$squeue -u $USER 85 144 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 86 145 3289740 workshop7 R tulaneID R 0:00 1 cypress01-009 87 146 # note - the following result after many attempts of top with result only %CPU ~= 100.0 88 [tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \147 [tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \ 89 148 awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ 90 149 END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' … … 93 152 }}} 94 153 154 ==== Calculate core efficiency ==== 155 95 156 The resulting core efficiency is 96 {{{ 97 (1556.3 / 100) / 16 = 0.97 98 }}} 157 158 {{{ 159 [tulaneID@cypress1 ~]$bc <<< "scale=2;(1556.3 / 100) / 16" 160 .97 161 }}} 162 99 163 This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores. 100 164 101 102 165 === Example 3: same R code requesting 2 nodes - 1 node unused === 103 166 … … 105 168 106 169 {{{ 107 [tulaneID@cypress1 R]$diff bootstrap.sh bootstrap2nodes.sh170 [tulaneID@cypress1 ~]$diff bootstrap.sh bootstrap2nodes.sh 108 171 7c7 109 172 < #SBATCH --nodes=1 # Number of Nodes 110 173 --- 111 174 > #SBATCH --nodes=2 # Number of Nodes 112 [tulaneID@cypress1 R]$sbatch bootstrap2nodes.sh175 [tulaneID@cypress1 ~]$sbatch bootstrap2nodes.sh 113 176 Submitted batch job 3289779 114 [tulaneID@cypress1 R]$squeue -u $USER177 [tulaneID@cypress1 ~]$squeue -u $USER 115 178 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 116 179 3289779 workshop7 R tulaneID R 0:03 2 cypress01-[009-010] 117 [tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \ 180 # use the following to list job nodes separately 181 [tulandID@cypress1 R]$scontrol show hostname cypress01-[009-010] 182 cypress01-009 183 cypress01-010 184 [tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \ 118 185 awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ 119 186 END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' 120 187 Total %CPU: 1587.6 121 188 Total %MEM: 3.3 122 [tulaneID@cypress1 R]$ssh cypress01-010 'top -b -n 1 -u $USER' | \189 [tulaneID@cypress1 ~]$ssh cypress01-010 'top -b -n 1 -u $USER' | \ 123 190 awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \ 124 191 END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }' … … 127 194 }}} 128 195 129 The resulting core efficiency is130 {{{ 131 [tulaneID@cypress1 R]$bc <<< "scale=3; (1587.6 / 100) / 16"196 The resulting core efficiency for each of the two requested nodes is 197 {{{ 198 [tulaneID@cypress1 ~]$bc <<< "scale=3; (1587.6 / 100) / 16" 132 199 .992 133 [tulaneID@cypress1 R]$bc <<< "scale=3; (13.3 / 100) / 16"200 [tulaneID@cypress1 ~]$bc <<< "scale=3; (13.3 / 100) / 16" 134 201 .008 135 202 }}} … … 138 205 * On the second node, cypress01-010, usage is '''nearly non-existent''' (.008 ~= 0.0). 139 206 140 == Running R on multiple nodes == 141 See also [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes]. 207 === Running R on multiple nodes === 208 209 For information on how to run R code on multiple nodes on a SLURM cluster, see [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].
