Context Navigation

WhileYourJobIsRunning

Timestamp:: 01/19/2026 08:47:50 PM (7 weeks ago)
Author:: Carl Baribault
Comment:: multiple clarifications

Legend:

: Unmodified
: Added
: Removed
: Modified

Workshops/JobParallelism/WhileYourJobIsRunning

-              v6
+              v7
  * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM.
   ||--ntasks=1||
   ||--cpus-per-task=10||
   ||--mem=128||
+  ||!--ntasks=1||
+  ||!--cpus-per-task=10||
+  ||!--mem=128||
 == Current core efficiency for running jobs: (actual core usage) / (requested core allocation)
+=== Example 1: an idev job for an idle interactive session
+. Log in to Cypress.
+. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session.
+=== Example 1: an idev job for an idle interactive session ===
+. Start an idev interactive session.
+{{{
+[tulaneID@cypress1 ~]$idev --partition=centos7
+Requesting 1 node(s)  task(s) to normal queue of centos7 partition
+task(s)/node, 20 cpu(s)/task, 0 MIC device(s)/node
+Time: 0 (hr) 60 (min).
+d 0h 60m
+Submitted batch job 3272175
+JOBID=3289908 begin on cypress01-066
+--> Creating interactive terminal session (login) on node cypress01-066.
+--> You have 0 (hr) 60 (min).
+--> Assigned Host List : /tmp/idev_nodes_file_tulaneID
+Last login: Thu Jan 15 14:42:40 2026 from cypress2.cm.cluster
+}}}
+ '''For workshop''' using only 2 requested cores
+{{{
+[tulaneID@cypress1 ~]$idev --partition=workshop7 -c 2
+}}}
+. Login to Cypress in a separate terminal session, and use the SLURM '''squeue''' command to determine the job's node list - in this case an idle interactive session.
 {{{
 …
            3272175   centos7 idv38428 tulaneID  R       3:54      1 cypress01-059
 }}}
 . For each job node, use the ssh and top commands in combination to determine your job's core usage on the given node such as the following. (See '''man top'''.)
+. For each job node, use the '''ssh''' and '''top''' commands in combination to determine the job's core usage on the given node such as the following. (See '''man top'''.)
  Here are the relevant output columns for the '''top''' command.
  ||='''top''' output column=||=Description=||=Notes=||
  ||%CPU||percentage of cores used per job process||sum(%CPU)/100=fractional # of cores in use on the node||
+ ||='''top''' command output column=||='''Description'''=||='''Notes'''=||
+ ||%CPU||percentage of cores used per job process (100% per full core used)||sum(%CPU)/100=fractional # of cores in use on the node||
  ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use||
 …
 }}}
 . If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is
+{{{
+(3.8 / 100) / 20 = 0.0019
+}}}
+{{{
+[tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 20"
+.0019
+}}}
   This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores.
+ '''For workshop''' using only 2 requested cores
+{{{
+[tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 2"
+.0190
+}}}
 === Example 2: a running batch job using R requesting 1 node ===
+ The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores.
+{{{
+[tulaneID@cypress1 R]$sbatch bootstrap.sh
+==== Prepare sample R code ====
+ For this and the following example, download and make a copy of the sample R code via the following.
+{{{
+[tulaneID@cypress1 ~]$git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git
+[tulaneID@cypress1 ~]$cp -r hpc-workshop/R/* .
+[tulaneID@cypress1 ~]$ls
+bootstrap.R  bootstrap.sh  bootstrapWargs.R  bootstrapWargs.sh  myRscript.R  slurmscript1  slurmscript2
+}}}
+ For demonstration purposes, the downloaded and copied R script, '''bootstrap.R''' (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]), has been modified to run 1000000 (1M) samples rather than the original 10000 (10K) samples.
+{{{
+[tulaneID@cypress1 ~]$diff bootstrap.R hpc-workshop/R/bootstrap.R
+c10
+< iterations <- 1000000# Number of iterations to run
+---
+> iterations <- 10000# Number of iterations to run
+bootstrap.R  bootstrap.sh  bootstrapWargs.R  bootstrapWargs.sh  myRscript.R  slurmscript1  slurmscript2
+}}}
+==== Submit the test batch job ====
+ The job script, '''bootstrap.sh''', is requesting 1 node and 16 cores.
+{{{
+[tulaneID@cypress1 ~]$grep cpus-per-task bootstrap.sh
+#SBATCH --cpus-per-task=16      # Number of threads per task (OMP threads)
+[tulaneID@cypress1 ~]$sbatch bootstrap.sh
 Submitted batch job 3289740
 [tulaneID@cypress1 R]$squeue -u $USER
+[tulaneID@cypress1 ~]$squeue -u $USER
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            3289740 workshop7        R tulaneID  R       0:00      1 cypress01-009
 # note - the following result after many attempts of top with result only %CPU ~= 100.0
 [tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
+[tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
 awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
 END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
 …
 }}}
+==== Calculate core efficiency ====
  The resulting core efficiency is
+{{{
+(1556.3 / 100) / 16 = 0.97
+}}}
+{{{
+[tulaneID@cypress1 ~]$bc <<< "scale=2;(1556.3 / 100) / 16"
+.97
+}}}
  This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores.
 === Example 3: same R code requesting 2 nodes - 1 node unused ===
 …
 {{{
 [tulaneID@cypress1 R]$diff bootstrap.sh bootstrap2nodes.sh
+[tulaneID@cypress1 ~]$diff bootstrap.sh bootstrap2nodes.sh
 c7
 < #SBATCH --nodes=1               # Number of Nodes
 ---
 > #SBATCH --nodes=2               # Number of Nodes
 [tulaneID@cypress1 R]$sbatch bootstrap2nodes.sh
+[tulaneID@cypress1 ~]$sbatch bootstrap2nodes.sh
 Submitted batch job 3289779
 [tulaneID@cypress1 R]$squeue -u $USER
+[tulaneID@cypress1 ~]$squeue -u $USER
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            3289779 workshop7        R tulaneID  R       0:03      2 cypress01-[009-010]
+[tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
+# use the following to list job nodes separately
+[tulandID@cypress1 R]$scontrol show hostname cypress01-[009-010]
+cypress01-009
+cypress01-010
+[tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
 awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
 END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
 Total %CPU: 1587.6
 Total %MEM: 3.3
 [tulaneID@cypress1 R]$ssh cypress01-010 'top -b -n 1 -u $USER' | \
+[tulaneID@cypress1 ~]$ssh cypress01-010 'top -b -n 1 -u $USER' | \
 awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
 END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
 …
 }}}
  The resulting core efficiency is
 {{{
 [tulaneID@cypress1 R]$bc <<< "scale=3; (1587.6 / 100) / 16"
+ The resulting core efficiency for each of the two requested nodes is
+{{{
+[tulaneID@cypress1 ~]$bc <<< "scale=3; (1587.6 / 100) / 16"
 .992
 [tulaneID@cypress1 R]$bc <<< "scale=3; (13.3 / 100) / 16"
+[tulaneID@cypress1 ~]$bc <<< "scale=3; (13.3 / 100) / 16"
 .008
 }}}
 …
  * On the second node, cypress01-010, usage is '''nearly non-existent''' (.008 ~= 0.0).
+== Running R on multiple nodes ==
+ See also [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].
+=== Running R on multiple nodes ===
+ For information on how to run R code on multiple nodes on a SLURM cluster, see [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].