| | 1 | = Job Array = |
| | 2 | f you haven't done yet, download Samples by: |
| | 3 | |
| | 4 | {{{svn co file:///home/fuji/repos/workshop ./workshop}}} |
| | 5 | |
| | 6 | Checkout Sample files onto local machine, (linux shell) |
| | 7 | |
| | 8 | {{{svn co svn+ssh://USERID@cypress1.tulane.edu/home/fuji/repos/workshop ./workshop}}} |
| | 9 | |
| | 10 | |
| | 11 | ---- |
| | 12 | |
| | 13 | If you are running a large number of serial jobs, it is recommended to submit them as a job array to make the best use of your allocated resources. |
| | 14 | |
| | 15 | === Use Array Task ID as a Parameter to the code === |
| | 16 | |
| | 17 | Get into '''JobArray1''' directory under '''workshop''', |
| | 18 | {{{ |
| | 19 | [fuji@cypress1 ~]$ cd workshop |
| | 20 | [fuji@cypress1 workshop]$ cd JobArray1/ |
| | 21 | [fuji@cypress1 JobArray1]$ ls |
| | 22 | hello2.py slurmscript |
| | 23 | }}} |
| | 24 | |
| | 25 | There is a python script that is |
| | 26 | {{{ |
| | 27 | [fuji@cypress1 JobArray1]$ cat hello2.py |
| | 28 | # HELLO PYTHON |
| | 29 | import sys |
| | 30 | import datetime |
| | 31 | import socket |
| | 32 | |
| | 33 | if (len(sys.argv) < 2): |
| | 34 | print '%s [taskID]' % (sys.argv[0]) |
| | 35 | sys.exit() |
| | 36 | # |
| | 37 | taskID = int(sys.argv[1]) |
| | 38 | now = datetime.datetime.now() |
| | 39 | print 'Hello, world!' |
| | 40 | print now.isoformat() |
| | 41 | print socket.gethostname() |
| | 42 | print 'My Task ID = %d' % taskID |
| | 43 | }}} |
| | 44 | |
| | 45 | This python script takes a number from the command-line and prints it. You can run it on the login node as, |
| | 46 | |
| | 47 | {{{ |
| | 48 | [fuji@cypress1 JobArray1]$ python ./hello2.py 123 |
| | 49 | Hello, world! |
| | 50 | 2018-08-22T13:47:28.572176 |
| | 51 | cypress1 |
| | 52 | My Task ID = 123 |
| | 53 | }}} |
| | 54 | |
| | 55 | Look at '''slurmscript''', |
| | 56 | {{{ |
| | 57 | [fuji@cypress1 JobArray1]$ cat slurmscript |
| | 58 | #!/bin/bash |
| | 59 | #SBATCH --qos=workshop # Quality of Service |
| | 60 | #SBATCH --partition=workshop # partition |
| | 61 | #SBATCH --job-name=python # Job Name |
| | 62 | #SBATCH --time=00:01:00 # WallTime |
| | 63 | #SBATCH --nodes=1 # Number of Nodes |
| | 64 | #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) |
| | 65 | #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) |
| | 66 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| | 67 | |
| | 68 | module load anaconda |
| | 69 | python hello2.py $SLURM_ARRAY_TASK_ID |
| | 70 | }}} |
| | 71 | |
| | 72 | ==== array ==== |
| | 73 | This |
| | 74 | {{{ |
| | 75 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| | 76 | }}} |
| | 77 | The argument can be specific array index values, a range of index values, and an optional step size. |
| | 78 | For example, |
| | 79 | {{{ |
| | 80 | #SBATCH --array=0-31 # Submit a job array with index values between 0 and 31 |
| | 81 | }}} |
| | 82 | |
| | 83 | {{{ |
| | 84 | #SBATCH --array=1,3,5,7 # Submit a job array with index values of 1, 3, 5 and 7 |
| | 85 | }}} |
| | 86 | |
| | 87 | {{{ |
| | 88 | #SBATCH --array=1-7:2 # Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) |
| | 89 | }}} |
| | 90 | Note that the minimum index value is zero. |
| | 91 | |
| | 92 | ==== Job ID and Environment Variables ==== |
| | 93 | Job arrays will have two additional environment variable set. |
| | 94 | '''SLURM_ARRAY_JOB_ID''' will be set to the first job ID of the array. |
| | 95 | '''SLURM_ARRAY_TASK_ID''' will be set to the job array index value. |
| | 96 | '''SLURM_ARRAY_TASK_COUNT''' will be set to the number of tasks in the job array. |
| | 97 | '''SLURM_ARRAY_TASK_MAX''' will be set to the highest job array index value. |
| | 98 | '''SLURM_ARRAY_TASK_MIN''' will be set to the lowest job array index value. |
| | 99 | |
| | 100 | In '''slurmscript''' above, SLURM_ARRAY_TASK_ID is given to the python code as, |
| | 101 | {{{ |
| | 102 | python hello2.py $SLURM_ARRAY_TASK_ID |
| | 103 | }}} |
| | 104 | so that the '''hello2.py''' prints the task ID. |
| | 105 | |
| | 106 | ==== Submit a Job ==== |
| | 107 | {{{ |
| | 108 | [fuji@cypress1 JobArray1]$ sbatch slurmscript |
| | 109 | Submitted batch job 773958 |
| | 110 | }}} |
| | 111 | |
| | 112 | After the job completed, you will see 10 new files as (it may take a while) |
| | 113 | {{{ |
| | 114 | [fuji@cypress1 JobArray1]$ ls |
| | 115 | hello2.py slurm-773958_1.out slurm-773958_3.out slurm-773958_5.out slurm-773958_7.out slurm-773958_9.out |
| | 116 | slurm-773958_0.out slurm-773958_2.out slurm-773958_4.out slurm-773958_6.out slurm-773958_8.out slurmscript |
| | 117 | }}} |
| | 118 | |
| | 119 | The default log file name is '''slurm-[Job ID]_[Task_ID].out'''. |
| | 120 | |
| | 121 | {{{ |
| | 122 | [fuji@cypress1 JobArray1]$ cat slurm-773958_9.out |
| | 123 | Hello, world! |
| | 124 | 2018-08-22T14:01:44.492109 |
| | 125 | cypress01-117 |
| | 126 | My Task ID = 9 |
| | 127 | }}} |
| | 128 | |
| | 129 | |
| | 130 | === Use Array Task ID to define the script file name === |
| | 131 | |
| | 132 | Get into '''JobArray2''' directory under '''workshop''', |
| | 133 | {{{ |
| | 134 | [fuji@cypress1 ~]$ cd workshop/JobArray2/ |
| | 135 | [fuji@cypress1 JobArray2]$ ls |
| | 136 | hello2.py script01.sh script03.sh script05.sh script07.sh script09.sh slurmscript2 |
| | 137 | script00.sh script02.sh script04.sh script06.sh script08.sh slurmscript1 |
| | 138 | [fuji@cypress1 JobArray2]$ |
| | 139 | }}} |
| | 140 | |
| | 141 | Look at '''slurmscript1''' |
| | 142 | {{{ |
| | 143 | [fuji@cypress1 JobArray2]$ cat slurmscript1 |
| | 144 | #!/bin/bash |
| | 145 | #SBATCH --qos=workshop # Quality of Service |
| | 146 | #SBATCH --partition=workshop # partition |
| | 147 | #SBATCH --job-name=python # Job Name |
| | 148 | #SBATCH --time=00:01:00 # WallTime |
| | 149 | #SBATCH --nodes=1 # Number of Nodes |
| | 150 | #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) |
| | 151 | #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) |
| | 152 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| | 153 | |
| | 154 | sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh |
| | 155 | }}} |
| | 156 | |
| | 157 | The last line, '''script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh''' makes the files name using '''$SLURM_ARRAY_TASK_ID''', |
| | 158 | so it will run script00.sh, script01.sh, script00.sh, ... and script09.sh. |
| | 159 | |
| | 160 | Submit job, |
| | 161 | {{{ |
| | 162 | [fuji@cypress1 JobArray2]$ sbatch slurmscript1 |
| | 163 | Submitted batch job 773970 |
| | 164 | }}} |
| | 165 | |
| | 166 | === Cancel Jobs in Job Array === |
| | 167 | Look at '''slurmscript12''' |
| | 168 | {{{ |
| | 169 | [fuji@cypress1 JobArray2]$ cat slurmscript2 |
| | 170 | #!/bin/bash |
| | 171 | #SBATCH --qos=workshop # Quality of Service |
| | 172 | #SBATCH --partition=workshop # partition |
| | 173 | #SBATCH --job-name=python # Job Name |
| | 174 | #SBATCH --time=00:01:00 # WallTime |
| | 175 | #SBATCH --nodes=1 # Number of Nodes |
| | 176 | #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) |
| | 177 | #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) |
| | 178 | #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 |
| | 179 | |
| | 180 | sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh |
| | 181 | |
| | 182 | sleep 60 |
| | 183 | }}} |
| | 184 | |
| | 185 | There is '''sleep 60''' so each task runs 60 seconds. |
| | 186 | Submit job and look at the jobs running/queued. |
| | 187 | {{{ |
| | 188 | [fuji@cypress1 JobArray2]$ sbatch slurmscript2 |
| | 189 | Submitted batch job 773980 |
| | 190 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| | 191 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| | 192 | 773980_0 worksh python fuji R 0:04 1 cypress01-117 |
| | 193 | 773980_1 worksh python fuji R 0:04 1 cypress01-117 |
| | 194 | 773980_2 worksh python fuji R 0:04 1 cypress01-117 |
| | 195 | 773980_3 worksh python fuji R 0:04 1 cypress01-117 |
| | 196 | 773980_4 worksh python fuji R 0:04 1 cypress01-117 |
| | 197 | 773980_5 worksh python fuji R 0:04 1 cypress01-117 |
| | 198 | 773980_6 worksh python fuji R 0:04 1 cypress01-117 |
| | 199 | 773980_7 worksh python fuji R 0:04 1 cypress01-117 |
| | 200 | 773980_8 worksh python fuji R 0:04 1 cypress01-117 |
| | 201 | 773980_9 worksh python fuji R 0:04 1 cypress01-117 |
| | 202 | }}} |
| | 203 | |
| | 204 | To cancel '''773980_1''' only, |
| | 205 | {{{ |
| | 206 | [fuji@cypress1 JobArray2]$ scancel 773980_1 |
| | 207 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| | 208 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| | 209 | 773980_0 worksh python fuji R 0:18 1 cypress01-117 |
| | 210 | 773980_2 worksh python fuji R 0:18 1 cypress01-117 |
| | 211 | 773980_3 worksh python fuji R 0:18 1 cypress01-117 |
| | 212 | 773980_4 worksh python fuji R 0:18 1 cypress01-117 |
| | 213 | 773980_5 worksh python fuji R 0:18 1 cypress01-117 |
| | 214 | 773980_6 worksh python fuji R 0:18 1 cypress01-117 |
| | 215 | 773980_7 worksh python fuji R 0:18 1 cypress01-117 |
| | 216 | 773980_8 worksh python fuji R 0:18 1 cypress01-117 |
| | 217 | 773980_9 worksh python fuji R 0:18 1 cypress01-117 |
| | 218 | }}} |
| | 219 | |
| | 220 | To cancel tasks 5-8, |
| | 221 | {{{ |
| | 222 | [fuji@cypress1 JobArray2]$ scancel 773980_[5-8] |
| | 223 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| | 224 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| | 225 | 773980_0 worksh python fuji R 0:30 1 cypress01-117 |
| | 226 | 773980_2 worksh python fuji R 0:30 1 cypress01-117 |
| | 227 | 773980_3 worksh python fuji R 0:30 1 cypress01-117 |
| | 228 | 773980_4 worksh python fuji R 0:30 1 cypress01-117 |
| | 229 | 773980_9 worksh python fuji R 0:30 1 cypress01-117 |
| | 230 | }}} |
| | 231 | |
| | 232 | To cancel all tasks, |
| | 233 | {{{ |
| | 234 | [fuji@cypress1 JobArray2]$ scancel 773980 |
| | 235 | [fuji@cypress1 JobArray2]$ squeue -u fuji |
| | 236 | JOBID QOS NAME USER ST TIME NO NODELIST(REASON) |
| | 237 | }}} |
| | 238 | |
| | 239 | |