Changes between Version 2 and Version 3 of Workshops/JobParallelism/WhileYourJobIsRunning


Ignore:
Timestamp:
01/19/26 12:50:07 (37 hours ago)
Author:
Carl Baribault
Comment:

added Example 3 case for unused node

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/JobParallelism/WhileYourJobIsRunning

    v2 v3  
    2020  ||--mem=128||
    2121
    22 == Example 1: an idev job's core efficiency: (actual core usage) / (requested core allocation)
     22== Core efficiency for running jobs: (actual core usage) / (requested core allocation)
     23
     24=== Example 1: an idev job for an idle interactive session
    2325 1. Log in to Cypress.
    2426 2. Use the SLURM '''squeue''' command to determine your job's node list - in this case an idle interactive session.
     
    7375  This is quite far from the ideal value, 1 - not very good usage of the node's 20 requested cores.
    7476
    75 == Example 2: a running batch job's core efficiency ==
     77=== Example 2: a running batch job using R requesting 1 node ===
    7678
    7779 The following is an example of the core usage for the R sampling code (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores.
     
    8385             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    8486           3289740 workshop7        R tulaneID  R       0:00      1 cypress01-009
     87# note - the following result after many attempts of top with result only %CPU ~= 100.0
    8588[tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
    8689awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
     
    8891Total %CPU: 1556.6
    8992Total %MEM: 3.3
    90 
    9193}}}
    9294
     
    9698}}}
    9799 This is quite close to the ideal value, 1 - fairly good usage of the node's 16 requested cores.
     100
     101
     102=== Example 3: same R code requesting 2 nodes - 1 node unused ===
     103
     104 The following uses the same the R sampling code as above (see [wiki:cypress/R#PassingSLURMEnvironmentVariables here]) requesting 16 cores and 2 nodes (--nodes - one of which is unused.
     105
     106{{{
     107[tulaneID@cypress1 R]$diff bootstrap.sh bootstrap2nodes.sh
     1087c7
     109< #SBATCH --nodes=1               # Number of Nodes
     110---
     111> #SBATCH --nodes=2               # Number of Nodes
     112[tulaneID@cypress1 R]$sbatch bootstrap2nodes.sh
     113Submitted batch job 3289779
     114[tulaneID@cypress1 R]$squeue -u $USER
     115             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     116           3289779 workshop7        R tulaneID  R       0:03      2 cypress01-[009-010]
     117[tulaneID@cypress1 R]$ssh cypress01-009 'top -b -n 1 -u $USER' | \
     118awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
     119END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
     120Total %CPU: 1587.6
     121Total %MEM: 3.3
     122[tulaneID@cypress1 R]$ssh cypress01-010 'top -b -n 1 -u $USER' | \
     123awk 'NR > 7 { sum_cpu += $9; sum_mem += $10 } \
     124END { print "Total %CPU:", sum_cpu; print "Total %MEM:", sum_mem }'
     125Total %CPU: 13.3
     126Total %MEM: 0
     127}}}
     128
     129 The resulting core efficiency is
     130{{{
     131[tulaneID@cypress1 R]$bc <<< "scale=3; (1587.6 / 100) / 16"
     132.992
     133[tulaneID@cypress1 R]$bc <<< "scale=3; (13.3 / 100) / 16"
     134.008
     135}}}
     136 Result:
     137 * On the first node, cypress01-009, usage is '''nearly ideal''' (.992 ~= 1.0).
     138 * On the second node, cypress01-010, usage is '''nearly non-existent''' (.008 ~= 0.0).
     139
     140== Running R on multiple nodes ==
     141 See also [wiki:/cypress/R#RunningRonmultiplenodes Running R on multiple nodes].