| 1 | = Job Array = |
| 2 | f you haven't done yet, download Samples by: |
| 3 | |
| 4 | {{{svn co file:///home/fuji/repos/workshop ./workshop}}} |
| 5 | |
| 6 | Checkout Sample files onto local machine, (linux shell) |
| 7 | |
| 8 | {{{svn co svn+ssh://USERID@cypress1.tulane.edu/home/fuji/repos/workshop ./workshop}}} |
| 9 | |
| 10 | |
| 11 | ---- |
| 12 | |
| 13 | If you are running a large number of serial jobs, it is recommended to submit them as a job array to make the best use of your allocated resources. |
| 14 | |
| 15 | === Use Array Task ID as a Parameter to the code === |
| 16 | |
| 17 | Get into '''JobArray1''' directory under '''workshop''', |
| 18 | {{{ |
| 19 | [fuji@cypress1 ~]$ cd workshop |
| 20 | [fuji@cypress1 workshop]$ cd JobArray1/ |
| 21 | [fuji@cypress1 JobArray1]$ ls |
| 22 | hello2.py slurmscript |
| 23 | }}} |
| 24 | |
| 25 | There is a python script that is |
| 26 | {{{ |
| 27 | [fuji@cypress1 JobArray1]$ cat hello2.py |
| 28 | # HELLO PYTHON |
| 29 | import sys |
| 30 | import datetime |
| 31 | import socket |
| 32 | |
| 33 | if (len(sys.argv) < 2): |
| 34 | print '%s [taskID]' % (sys.argv[0]) |
| 35 | sys.exit() |
| 36 | # |
| 37 | taskID = int(sys.argv[1]) |
| 38 | now = datetime.datetime.now() |
| 39 | print 'Hello, world!' |
| 40 | print now.isoformat() |
| 41 | print socket.gethostname() |
| 42 | print 'My Task ID = %d' % taskID |
| 43 | }}} |
| 44 | |
| 45 | This python script takes a number from the command-line and prints it. You can run it on the login node as, |
| 46 | |
| 47 | {{{ |
| 48 | [fuji@cypress1 JobArray1]$ python ./hello2.py 123 |
| 49 | Hello, world! |
| 50 | 2018-08-22T13:47:28.572176 |
| 51 | cypress1 |
| 52 | My Task ID = 123 |
| 53 | }}} |
| 54 | |
| 55 | Look at '''slurmscript''', |
| 56 | {{{ |
| 57 | [fuji@cypress1 JobArray1]$ cat slurmscript |
| 58 | #!/bin/bash |
| 59 | #SBATCH --qos=workshop # Quality of Service |
| 60 | #SBATCH --partition=workshop # partition |
| 61 | #SBATCH --job-name=python # Job Name |
| 62 | #SBATCH --time=00:01:00 # WallTime |
| 63 | #SBATCH --nodes=1 # Number of Nodes |
| 64 | #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) |
| 65 | #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) |
| 66 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| 67 | |
| 68 | module load anaconda |
| 69 | python hello2.py $SLURM_ARRAY_TASK_ID |
| 70 | }}} |
| 71 | |
| 72 | ==== array ==== |
| 73 | This |
| 74 | {{{ |
| 75 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| 76 | }}} |
| 77 | The argument can be specific array index values, a range of index values, and an optional step size. |
| 78 | For example, |
| 79 | {{{ |
| 80 | #SBATCH --array=0-31 # Submit a job array with index values between 0 and 31 |
| 81 | }}} |
| 82 | |
| 83 | {{{ |
| 84 | #SBATCH --array=1,3,5,7 # Submit a job array with index values of 1, 3, 5 and 7 |
| 85 | }}} |
| 86 | |
| 87 | {{{ |
| 88 | #SBATCH --array=1-7:2 # Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) |
| 89 | }}} |
| 90 | Note that the minimum index value is zero. |
| 91 | |
| 92 | ==== Job ID and Environment Variables ==== |
| 93 | Job arrays will have two additional environment variable set. |
| 94 | '''SLURM_ARRAY_JOB_ID''' will be set to the first job ID of the array. |
| 95 | '''SLURM_ARRAY_TASK_ID''' will be set to the job array index value. |
| 96 | '''SLURM_ARRAY_TASK_COUNT''' will be set to the number of tasks in the job array. |
| 97 | '''SLURM_ARRAY_TASK_MAX''' will be set to the highest job array index value. |
| 98 | '''SLURM_ARRAY_TASK_MIN''' will be set to the lowest job array index value. |
| 99 | |
| 100 | In '''slurmscript''' above, SLURM_ARRAY_TASK_ID is given to the python code as, |
| 101 | {{{ |
| 102 | python hello2.py $SLURM_ARRAY_TASK_ID |
| 103 | }}} |
| 104 | so that the '''hello2.py''' prints the task ID. |
| 105 | |
| 106 | ==== Submit a Job ==== |
| 107 | {{{ |
| 108 | [fuji@cypress1 JobArray1]$ sbatch slurmscript |
| 109 | Submitted batch job 773958 |
| 110 | }}} |
| 111 | |
| 112 | After the job completed, you will see 10 new files as (it may take a while) |
| 113 | {{{ |
| 114 | [fuji@cypress1 JobArray1]$ ls |
| 115 | hello2.py slurm-773958_1.out slurm-773958_3.out slurm-773958_5.out slurm-773958_7.out slurm-773958_9.out |
| 116 | slurm-773958_0.out slurm-773958_2.out slurm-773958_4.out slurm-773958_6.out slurm-773958_8.out slurmscript |
| 117 | }}} |
| 118 | |
| 119 | The default log file name is '''slurm-[Job ID]_[Task_ID].out'''. |
| 120 | |
| 121 | {{{ |
| 122 | [fuji@cypress1 JobArray1]$ cat slurm-773958_9.out |
| 123 | Hello, world! |
| 124 | 2018-08-22T14:01:44.492109 |
| 125 | cypress01-117 |
| 126 | My Task ID = 9 |
| 127 | }}} |
| 128 | |
| 129 | |
| 130 | === Use Array Task ID to define the script file name === |
| 131 | |
| 132 | Get into '''JobArray2''' directory under '''workshop''', |
| 133 | {{{ |
| 134 | [fuji@cypress1 ~]$ cd workshop/JobArray2/ |
| 135 | [fuji@cypress1 JobArray2]$ ls |
| 136 | hello2.py script01.sh script03.sh script05.sh script07.sh script09.sh slurmscript2 |
| 137 | script00.sh script02.sh script04.sh script06.sh script08.sh slurmscript1 |
| 138 | [fuji@cypress1 JobArray2]$ |
| 139 | }}} |
| 140 | |
| 141 | Look at '''slurmscript1''' |
| 142 | {{{ |
| 143 | [fuji@cypress1 JobArray2]$ cat slurmscript1 |
| 144 | #!/bin/bash |
| 145 | #SBATCH --qos=workshop # Quality of Service |
| 146 | #SBATCH --partition=workshop # partition |
| 147 | #SBATCH --job-name=python # Job Name |
| 148 | #SBATCH --time=00:01:00 # WallTime |
| 149 | #SBATCH --nodes=1 # Number of Nodes |
| 150 | #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) |
| 151 | #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) |
| 152 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| 153 | |
| 154 | sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh |
| 155 | }}} |
| 156 | |
| 157 | The last line, '''script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh''' makes the files name using '''$SLURM_ARRAY_TASK_ID''', |
| 158 | so it will run script00.sh, script01.sh, script00.sh, ... and script09.sh. |
| 159 | |
| 160 | Submit job, |
| 161 | {{{ |
| 162 | [fuji@cypress1 JobArray2]$ sbatch slurmscript1 |
| 163 | Submitted batch job 773970 |
| 164 | }}} |
| 165 | |
| 166 | === Cancel Jobs in Job Array === |
| 167 | Look at '''slurmscript12''' |
| 168 | {{{ |
| 169 | [fuji@cypress1 JobArray2]$ cat slurmscript2 |
| 170 | #!/bin/bash |
| 171 | #SBATCH --qos=workshop # Quality of Service |
| 172 | #SBATCH --partition=workshop # partition |
| 173 | #SBATCH --job-name=python # Job Name |
| 174 | #SBATCH --time=00:01:00 # WallTime |
| 175 | #SBATCH --nodes=1 # Number of Nodes |
| 176 | #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) |
| 177 | #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) |
| 178 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| 179 | |
| 180 | sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh |
| 181 | |
| 182 | sleep 60 |
| 183 | }}} |
| 184 | |
| 185 | There is '''sleep 60''' so each task runs 60 seconds. |
| 186 | Submit job and look at the jobs running/queued. |
| 187 | {{{ |
| 188 | [fuji@cypress1 JobArray2]$ sbatch slurmscript2 |
| 189 | Submitted batch job 773980 |
| 190 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| 191 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| 192 | 773980_0 worksh python fuji R 0:04 1 cypress01-117 |
| 193 | 773980_1 worksh python fuji R 0:04 1 cypress01-117 |
| 194 | 773980_2 worksh python fuji R 0:04 1 cypress01-117 |
| 195 | 773980_3 worksh python fuji R 0:04 1 cypress01-117 |
| 196 | 773980_4 worksh python fuji R 0:04 1 cypress01-117 |
| 197 | 773980_5 worksh python fuji R 0:04 1 cypress01-117 |
| 198 | 773980_6 worksh python fuji R 0:04 1 cypress01-117 |
| 199 | 773980_7 worksh python fuji R 0:04 1 cypress01-117 |
| 200 | 773980_8 worksh python fuji R 0:04 1 cypress01-117 |
| 201 | 773980_9 worksh python fuji R 0:04 1 cypress01-117 |
| 202 | }}} |
| 203 | |
| 204 | To cancel '''773980_1''' only, |
| 205 | {{{ |
| 206 | [fuji@cypress1 JobArray2]$ scancel 773980_1 |
| 207 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| 208 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| 209 | 773980_0 worksh python fuji R 0:18 1 cypress01-117 |
| 210 | 773980_2 worksh python fuji R 0:18 1 cypress01-117 |
| 211 | 773980_3 worksh python fuji R 0:18 1 cypress01-117 |
| 212 | 773980_4 worksh python fuji R 0:18 1 cypress01-117 |
| 213 | 773980_5 worksh python fuji R 0:18 1 cypress01-117 |
| 214 | 773980_6 worksh python fuji R 0:18 1 cypress01-117 |
| 215 | 773980_7 worksh python fuji R 0:18 1 cypress01-117 |
| 216 | 773980_8 worksh python fuji R 0:18 1 cypress01-117 |
| 217 | 773980_9 worksh python fuji R 0:18 1 cypress01-117 |
| 218 | }}} |
| 219 | |
| 220 | To cancel tasks 5-8, |
| 221 | {{{ |
| 222 | [fuji@cypress1 JobArray2]$ scancel 773980_[5-8] |
| 223 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| 224 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| 225 | 773980_0 worksh python fuji R 0:30 1 cypress01-117 |
| 226 | 773980_2 worksh python fuji R 0:30 1 cypress01-117 |
| 227 | 773980_3 worksh python fuji R 0:30 1 cypress01-117 |
| 228 | 773980_4 worksh python fuji R 0:30 1 cypress01-117 |
| 229 | 773980_9 worksh python fuji R 0:30 1 cypress01-117 |
| 230 | }}} |
| 231 | |
| 232 | To cancel all tasks, |
| 233 | {{{ |
| 234 | [fuji@cypress1 JobArray2]$ scancel 773980 |
| 235 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| 236 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| 237 | }}} |
| 238 | |
| 239 | |