Changes between Version 6 and Version 7 of Workshops/JobParallelism/WhileYourJobIsRunning


Ignore:
Timestamp:
01/19/26 20:47:50 (9 hours ago)
Author:
Carl Baribault
Comment:

multiple clarifications

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobParallelism/WhileYourJobIsRunning

    v6 v7  
    1616 * For example, your job may require only 10 cores but all of the RAM available on a node with 128GB of RAM.
    1717
    18   ||--ntasks=1||
    19   ||--cpus-per-task=10||
    20   ||--mem=128||
     18  ||!--ntasks=1||
     19  ||!--cpus-per-task=10||
     20  ||!--mem=128||
    2121
    2222== Current core efficiency for running jobs: (actual core usage) / (requested core allocation)
    2323
    24 === Example 1: an idev job for an idle interactive session
    25  1. Log in to Cypress.
    26  2. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session.
     24=== Example 1: an idev job for an idle interactive session ===
     25
     26 1. Start an idev interactive session.
     27
     28{{{
     29[tulaneID@cypress1 ~]$idev --partition=centos7
     30Requesting 1 node(s)  task(s) to normal queue of centos7 partition
     311 task(s)/node, 20 cpu(s)/task, 0 MIC device(s)/node
     32Time: 0 (hr) 60 (min).
     330d 0h 60m
     34Submitted batch job 3272175
     35JOBID=3289908 begin on cypress01-066
     36--> Creating interactive terminal session (login) on node cypress01-066.
     37--> You have 0 (hr) 60 (min).
     38--> Assigned Host List : /tmp/idev_nodes_file_tulaneID
     39Last login: Thu Jan 15 14:42:40 2026 from cypress2.cm.cluster
     40}}}
     41
     42 '''For workshop''' using only 2 requested cores
     43
     44{{{
     45[tulaneID@cypress1 ~]$idev --partition=workshop7 -c 2
     46}}}
     47
     48 2. Login to Cypress in a separate terminal session, and use the SLURM '''squeue''' command to determine the job's node list - in this case an idle interactive session.
    2749 
    2850{{{
     
    3153           3272175   centos7 idv38428 tulaneID  R       3:54      1 cypress01-059
    3254}}}
    33  3. For each job node, use the ssh and top commands in combination to determine your job's core usage on the given node such as the following. (See '''man top'''.)
     55 3. For each job node, use the '''ssh''' and '''top''' commands in combination to determine the job's core usage on the given node such as the following. (See '''man top'''.)
    3456
    3557 Here are the relevant output columns for the '''top''' command.
    3658
    37  ||='''top''' output column=||=Description=||=Notes=||
    38  ||%CPU||percentage of cores used per job process||sum(%CPU)/100=fractional # of cores in use on the node||
     59 ||='''top''' command output column=||='''Description'''=||='''Notes'''=||
     60 ||%CPU||percentage of cores used per job process (100% per full core used)||sum(%CPU)/100=fractional # of cores in use on the node||
    3961 ||%MEM||percentage of RAM used per job process||sum(%MEM)=percentage of node's total RAM in use||
    4062
     
    6991
    7092}}}
     93
    7194 5. If the idev session requested 20 cores (default=20), then the core efficiency of the idle idev session is
    72 {{{
    73 (3.8 / 100) / 20 = 0.0019
    74 }}}
     95
     96{{{
     97[tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 20"
     98.0019
     99}}}
     100
    75101  This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores.
    76102
     103 '''For workshop''' using only 2 requested cores
     104
     105{{{
     106[tulaneID@cypress1 ~]$bc <<< "scale=4;(3.8 / 100) / 2"
     107.0190
     108}}}
     109
    77110=== Example 2: a running batch job using R requesting 1 node ===
    78111
    79  The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores.
    80 
    81 {{{
    82 [tulaneID@cypress1 R]$sbatch bootstrap.sh
     112==== Prepare sample R code ====
     113
     114 For this and the following example, download and make a copy of the sample R code via the following.
     115
     116{{{
     117[tulaneID@cypress1 ~]$git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git
     118[tulaneID@cypress1 ~]$cp -r hpc-workshop/R/* .
     119[tulaneID@cypress1 ~]$ls
     120bootstrap.R  bootstrap.sh  bootstrapWargs.R  bootstrapWargs.sh  myRscript.R  slurmscript1  slurmscript2
     121}}}
     122
     123 For demonstration purposes, the downloaded and copied R script, '''bootstrap.R''' (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]), has been modified to run 1000000 (1M) samples rather than the original 10000 (10K) samples.
     124
     125{{{
     126[tulaneID@cypress1 ~]$diff bootstrap.R hpc-workshop/R/bootstrap.R
     12710c10
     128< iterations <- 1000000# Number of iterations to run
     129---
     130> iterations <- 10000# Number of iterations to run
     131bootstrap.R  bootstrap.sh  bootstrapWargs.R  bootstrapWargs.sh  myRscript.R  slurmscript1  slurmscript2
     132}}}
     133
     134==== Submit the test batch job ====
     135
     136 The job script, '''bootstrap.sh''', is requesting 1 node and 16 cores.
     137
     138{{{
     139[tulaneID@cypress1 ~]$grep cpus-per-task bootstrap.sh
     140#SBATCH --cpus-per-task=16      # Number of threads per task (OMP threads)
     141[tulaneID@cypress1 ~]$sbatch bootstrap.sh
    83142Submitted batch job 3289740
    84 [tulaneID@cypress1 R]$squeue -u $USER
     143[tulaneID@cypress1 ~]$squeue -u $USER
    85144             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    86145           3289740 workshop7        R tulaneID  R       0:00      1 cypress01-009
    87146# note - the following result after many attempts of top with result only %CPU ~= 100.0
    88 [tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
     147[tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
    89148awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
    90149END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
     
    93152}}}
    94153
     154==== Calculate core efficiency ====
     155
    95156 The resulting core efficiency is
    96 {{{
    97 (1556.3 / 100) / 16 = 0.97
    98 }}}
     157
     158{{{
     159[tulaneID@cypress1 ~]$bc <<< "scale=2;(1556.3 / 100) / 16"
     160.97
     161}}}
     162
    99163 This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores.
    100164
    101 
    102165=== Example 3: same R code requesting 2 nodes - 1 node unused ===
    103166
     
    105168
    106169{{{
    107 [tulaneID@cypress1 R]$diff bootstrap.sh bootstrap2nodes.sh
     170[tulaneID@cypress1 ~]$diff bootstrap.sh bootstrap2nodes.sh
    1081717c7
    109172< #SBATCH --nodes=1               # Number of Nodes
    110173---
    111174> #SBATCH --nodes=2               # Number of Nodes
    112 [tulaneID@cypress1 R]$sbatch bootstrap2nodes.sh
     175[tulaneID@cypress1 ~]$sbatch bootstrap2nodes.sh
    113176Submitted batch job 3289779
    114 [tulaneID@cypress1 R]$squeue -u $USER
     177[tulaneID@cypress1 ~]$squeue -u $USER
    115178             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    116179           3289779 workshop7        R tulaneID  R       0:03      2 cypress01-[009-010]
    117 [tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
     180# use the following to list job nodes separately
     181[tulandID@cypress1 R]$scontrol show hostname cypress01-[009-010]
     182cypress01-009
     183cypress01-010
     184[tulaneID@cypress1 ~]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
    118185awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
    119186END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
    120187Total %CPU: 1587.6
    121188Total %MEM: 3.3
    122 [tulaneID@cypress1 R]$ssh cypress01-010 'top -b -n 1 -u $USER' | \
     189[tulaneID@cypress1 ~]$ssh cypress01-010 'top -b -n 1 -u $USER' | \
    123190awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
    124191END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
     
    127194}}}
    128195
    129  The resulting core efficiency is
    130 {{{
    131 [tulaneID@cypress1 R]$bc <<< "scale=3; (1587.6 / 100) / 16"
     196 The resulting core efficiency for each of the two requested nodes is
     197{{{
     198[tulaneID@cypress1 ~]$bc <<< "scale=3; (1587.6 / 100) / 16"
    132199.992
    133 [tulaneID@cypress1 R]$bc <<< "scale=3; (13.3 / 100) / 16"
     200[tulaneID@cypress1 ~]$bc <<< "scale=3; (13.3 / 100) / 16"
    134201.008
    135202}}}
     
    138205 * On the second node, cypress01-010, usage is '''nearly non-existent''' (.008 ~= 0.0).
    139206
    140 == Running R on multiple nodes ==
    141  See also [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].
     207=== Running R on multiple nodes ===
     208
     209 For information on how to run R code on multiple nodes on a SLURM cluster, see [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].