Changes between Initial Version and Version 1 of Workshops/cypress/JobArrays


Ignore:
Timestamp:
08/22/18 14:24:24 (6 years ago)
Author:
fuji
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Workshops/cypress/JobArrays

    v1 v1  
     1= Job Array =
     2f you haven't done yet, download Samples by:
     3
     4{{{svn co file:///home/fuji/repos/workshop ./workshop}}}
     5
     6Checkout Sample files onto local machine, (linux shell)
     7
     8{{{svn co svn+ssh://USERID@cypress1.tulane.edu/home/fuji/repos/workshop ./workshop}}}
     9
     10
     11----
     12
     13If you are running a large number of serial jobs, it is recommended to submit them as a job array to make the best use of your allocated resources.
     14
     15=== Use Array Task ID as a Parameter to the code ===
     16
     17Get into '''JobArray1''' directory under '''workshop''',
     18{{{
     19[fuji@cypress1 ~]$ cd workshop
     20[fuji@cypress1 workshop]$ cd JobArray1/
     21[fuji@cypress1 JobArray1]$ ls
     22hello2.py  slurmscript
     23}}}
     24
     25There is a python script that is
     26{{{
     27[fuji@cypress1 JobArray1]$ cat hello2.py
     28# HELLO PYTHON
     29import sys
     30import datetime
     31import socket
     32
     33if (len(sys.argv) < 2):
     34        print '%s [taskID]' % (sys.argv[0])
     35        sys.exit()
     36#
     37taskID = int(sys.argv[1])
     38now = datetime.datetime.now()
     39print 'Hello, world!'
     40print now.isoformat()
     41print socket.gethostname()
     42print 'My Task ID = %d' % taskID
     43}}}
     44
     45This python script takes a number from the command-line and prints it. You can run it on the login node as,
     46
     47{{{
     48[fuji@cypress1 JobArray1]$ python ./hello2.py 123
     49Hello, world!
     502018-08-22T13:47:28.572176
     51cypress1
     52My Task ID = 123
     53}}}
     54
     55Look at '''slurmscript''',
     56{{{
     57[fuji@cypress1 JobArray1]$ cat slurmscript
     58#!/bin/bash
     59#SBATCH --qos=workshop            # Quality of Service
     60#SBATCH --partition=workshop      # partition
     61#SBATCH --job-name=python       # Job Name
     62#SBATCH --time=00:01:00         # WallTime
     63#SBATCH --nodes=1               # Number of Nodes
     64#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
     65#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
     66#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9
     67
     68module load anaconda
     69python hello2.py $SLURM_ARRAY_TASK_ID
     70}}}
     71
     72==== array ====
     73This
     74{{{
     75#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9
     76}}}
     77The argument can be specific array index values, a range of index values, and an optional step size.
     78For example,
     79{{{
     80#SBATCH --array=0-31  # Submit a job array with index values between 0 and 31
     81}}}
     82
     83{{{
     84#SBATCH --array=1,3,5,7 # Submit a job array with index values of 1, 3, 5 and 7
     85}}}
     86
     87{{{
     88#SBATCH --array=1-7:2 # Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7)
     89}}}
     90Note that the minimum index value is zero.
     91
     92==== Job ID and Environment Variables ====
     93Job arrays will have two additional environment variable set.
     94'''SLURM_ARRAY_JOB_ID''' will be set to the first job ID of the array.
     95'''SLURM_ARRAY_TASK_ID''' will be set to the job array index value.
     96'''SLURM_ARRAY_TASK_COUNT''' will be set to the number of tasks in the job array.
     97'''SLURM_ARRAY_TASK_MAX''' will be set to the highest job array index value.
     98'''SLURM_ARRAY_TASK_MIN''' will be set to the lowest job array index value.
     99
     100In '''slurmscript''' above, SLURM_ARRAY_TASK_ID is given to the python code as,
     101{{{
     102python hello2.py $SLURM_ARRAY_TASK_ID
     103}}}
     104so that the '''hello2.py''' prints the task ID.
     105
     106==== Submit a Job ====
     107{{{
     108[fuji@cypress1 JobArray1]$ sbatch slurmscript
     109Submitted batch job 773958
     110}}}
     111
     112After the job completed, you will see 10 new files as (it may take a while)
     113{{{
     114[fuji@cypress1 JobArray1]$ ls
     115hello2.py           slurm-773958_1.out  slurm-773958_3.out  slurm-773958_5.out  slurm-773958_7.out  slurm-773958_9.out
     116slurm-773958_0.out  slurm-773958_2.out  slurm-773958_4.out  slurm-773958_6.out  slurm-773958_8.out  slurmscript
     117}}}
     118
     119The default log file name is '''slurm-[Job ID]_[Task_ID].out'''.
     120
     121{{{
     122[fuji@cypress1 JobArray1]$ cat slurm-773958_9.out
     123Hello, world!
     1242018-08-22T14:01:44.492109
     125cypress01-117
     126My Task ID = 9
     127}}}
     128
     129
     130=== Use Array Task ID to define the script file name ===
     131
     132Get into '''JobArray2''' directory under '''workshop''',
     133{{{
     134[fuji@cypress1 ~]$ cd workshop/JobArray2/
     135[fuji@cypress1 JobArray2]$ ls
     136hello2.py    script01.sh  script03.sh  script05.sh  script07.sh  script09.sh   slurmscript2
     137script00.sh  script02.sh  script04.sh  script06.sh  script08.sh  slurmscript1
     138[fuji@cypress1 JobArray2]$
     139}}}
     140
     141Look at '''slurmscript1'''
     142{{{
     143[fuji@cypress1 JobArray2]$ cat slurmscript1
     144#!/bin/bash
     145#SBATCH --qos=workshop            # Quality of Service
     146#SBATCH --partition=workshop      # partition
     147#SBATCH --job-name=python       # Job Name
     148#SBATCH --time=00:01:00         # WallTime
     149#SBATCH --nodes=1               # Number of Nodes
     150#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
     151#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
     152#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9
     153
     154sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh
     155}}}
     156
     157The last line, '''script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh''' makes the files name using '''$SLURM_ARRAY_TASK_ID''',
     158so it will run script00.sh, script01.sh, script00.sh, ... and script09.sh.
     159
     160Submit job,
     161{{{
     162[fuji@cypress1 JobArray2]$ sbatch slurmscript1
     163Submitted batch job 773970
     164}}}
     165
     166=== Cancel Jobs in Job Array ===
     167 Look at '''slurmscript12'''
     168{{{
     169[fuji@cypress1 JobArray2]$ cat slurmscript2
     170#!/bin/bash
     171#SBATCH --qos=workshop            # Quality of Service
     172#SBATCH --partition=workshop      # partition
     173#SBATCH --job-name=python       # Job Name
     174#SBATCH --time=00:01:00         # WallTime
     175#SBATCH --nodes=1               # Number of Nodes
     176#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
     177#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
     178#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9
     179
     180sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh
     181
     182sleep 60
     183}}}
     184
     185There is '''sleep 60''' so each task runs 60 seconds.
     186Submit job and look at the jobs running/queued.
     187{{{
     188[fuji@cypress1 JobArray2]$ sbatch slurmscript2
     189Submitted batch job 773980
     190[fuji@cypress1 JobArray2]$ squeue -u fuji
     191     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
     192  773980_0 worksh             python     fuji  R       0:04  1 cypress01-117
     193  773980_1 worksh             python     fuji  R       0:04  1 cypress01-117
     194  773980_2 worksh             python     fuji  R       0:04  1 cypress01-117
     195  773980_3 worksh             python     fuji  R       0:04  1 cypress01-117
     196  773980_4 worksh             python     fuji  R       0:04  1 cypress01-117
     197  773980_5 worksh             python     fuji  R       0:04  1 cypress01-117
     198  773980_6 worksh             python     fuji  R       0:04  1 cypress01-117
     199  773980_7 worksh             python     fuji  R       0:04  1 cypress01-117
     200  773980_8 worksh             python     fuji  R       0:04  1 cypress01-117
     201  773980_9 worksh             python     fuji  R       0:04  1 cypress01-117
     202}}}
     203
     204To cancel '''773980_1''' only,
     205{{{
     206[fuji@cypress1 JobArray2]$ scancel 773980_1
     207[fuji@cypress1 JobArray2]$ squeue -u fuji
     208     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
     209  773980_0 worksh             python     fuji  R       0:18  1 cypress01-117
     210  773980_2 worksh             python     fuji  R       0:18  1 cypress01-117
     211  773980_3 worksh             python     fuji  R       0:18  1 cypress01-117
     212  773980_4 worksh             python     fuji  R       0:18  1 cypress01-117
     213  773980_5 worksh             python     fuji  R       0:18  1 cypress01-117
     214  773980_6 worksh             python     fuji  R       0:18  1 cypress01-117
     215  773980_7 worksh             python     fuji  R       0:18  1 cypress01-117
     216  773980_8 worksh             python     fuji  R       0:18  1 cypress01-117
     217  773980_9 worksh             python     fuji  R       0:18  1 cypress01-117
     218}}}
     219
     220To cancel tasks 5-8,
     221{{{
     222[fuji@cypress1 JobArray2]$ scancel 773980_[5-8]
     223[fuji@cypress1 JobArray2]$ squeue -u fuji
     224     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
     225  773980_0 worksh             python     fuji  R       0:30  1 cypress01-117
     226  773980_2 worksh             python     fuji  R       0:30  1 cypress01-117
     227  773980_3 worksh             python     fuji  R       0:30  1 cypress01-117
     228  773980_4 worksh             python     fuji  R       0:30  1 cypress01-117
     229  773980_9 worksh             python     fuji  R       0:30  1 cypress01-117
     230}}}
     231
     232To cancel all tasks,
     233{{{
     234[fuji@cypress1 JobArray2]$ scancel 773980
     235[fuji@cypress1 JobArray2]$ squeue -u fuji
     236     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
     237}}}
     238
     239