= Job Array = If you haven't done yet, download Samples by: {{{git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git}}} ---- If you are running a large number of serial/parallel jobs that are independent each other, it is recommended to submit them as a job array to make the best use of your allocated resources. === Use Array Task ID as a Parameter to the code === Get into '''JobArray1''' directory under '''workshop''', {{{ [fuji@cypress1 ~]$ cd workshop [fuji@cypress1 workshop]$ cd JobArray1/ [fuji@cypress1 JobArray1]$ ls hello2.py slurmscript }}} There is a python script that is {{{ [fuji@cypress1 JobArray1]$ cat hello2.py # HELLO PYTHON import sys import datetime import socket if (len(sys.argv) < 2): print '%s [taskID]' % (sys.argv[0]) sys.exit() # taskID = int(sys.argv[1]) now = datetime.datetime.now() print 'Hello, world!' print now.isoformat() print socket.gethostname() print 'My Task ID = %d' % taskID }}} This python script takes a number from the command-line and prints it. You can run it on the login node as, {{{ [fuji@cypress1 JobArray1]$ python ./hello2.py 123 Hello, world! 2018-08-22T13:47:28.572176 cypress1 My Task ID = 123 }}} Look at '''slurmscript''', {{{ [fuji@cypress1 JobArray1]$ cat slurmscript #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 module load anaconda python hello2.py $SLURM_ARRAY_TASK_ID }}} ==== array ==== This {{{ #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 }}} The argument can be specific array index values, a range of index values, and an optional step size. For example, {{{ #SBATCH --array=0-31 # Submit a job array with index values between 0 and 31 }}} {{{ #SBATCH --array=1,3,5,7 # Submit a job array with index values of 1, 3, 5 and 7 }}} {{{ #SBATCH --array=1-7:2 # Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) }}} Note that the minimum index value is zero. ==== Job ID and Environment Variables ==== Job arrays will have two additional environment variable set. '''SLURM_ARRAY_JOB_ID''' will be set to the first job ID of the array. '''SLURM_ARRAY_TASK_ID''' will be set to the job array index value. ~~'''SLURM_ARRAY_TASK_COUNT''' will be set to the number of tasks in the job array. ~~ ~~'''SLURM_ARRAY_TASK_MAX''' will be set to the highest job array index value.~~ ~~'''SLURM_ARRAY_TASK_MIN''' will be set to the lowest job array index value. ~~ In '''slurmscript''' above, SLURM_ARRAY_TASK_ID is given to the python code as, {{{ python hello2.py $SLURM_ARRAY_TASK_ID }}} so that the '''hello2.py''' prints the task ID. ==== Submit a Job ==== {{{ [fuji@cypress1 JobArray1]$ sbatch slurmscript Submitted batch job 773958 }}} After the job completed, you will see 10 new files as (it may take a while) {{{ [fuji@cypress1 JobArray1]$ ls hello2.py slurm-773958_1.out slurm-773958_3.out slurm-773958_5.out slurm-773958_7.out slurm-773958_9.out slurm-773958_0.out slurm-773958_2.out slurm-773958_4.out slurm-773958_6.out slurm-773958_8.out slurmscript }}} The default log file name is '''slurm-[Job ID]_[Task_ID].out'''. {{{ [fuji@cypress1 JobArray1]$ cat slurm-773958_9.out Hello, world! 2018-08-22T14:01:44.492109 cypress01-117 My Task ID = 9 }}} === Use Array Task ID to define the script file name === Get into '''JobArray2''' directory under '''workshop''', {{{ [fuji@cypress1 ~]$ cd workshop/JobArray2/ [fuji@cypress1 JobArray2]$ ls hello2.py script01.sh script03.sh script05.sh script07.sh script09.sh slurmscript2 script00.sh script02.sh script04.sh script06.sh script08.sh slurmscript1 [fuji@cypress1 JobArray2]$ }}} Look at '''slurmscript1''' {{{ [fuji@cypress1 JobArray2]$ cat slurmscript1 #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh }}} The last line, '''script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh''' makes the files name using '''$SLURM_ARRAY_TASK_ID''', so it will run script00.sh, script01.sh, script00.sh, ... and script09.sh. Submit job, {{{ [fuji@cypress1 JobArray2]$ sbatch slurmscript1 Submitted batch job 773970 }}} === Cancel Jobs in Job Array === Look at '''slurmscript12''' {{{ [fuji@cypress1 JobArray2]$ cat slurmscript2 #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh sleep 60 }}} There is '''sleep 60''' so each task runs 60 seconds. Submit job and look at the jobs running/queued. {{{ [fuji@cypress1 JobArray2]$ sbatch slurmscript2 Submitted batch job 773980 [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773980_0 worksh python fuji R 0:04 1 cypress01-117 773980_1 worksh python fuji R 0:04 1 cypress01-117 773980_2 worksh python fuji R 0:04 1 cypress01-117 773980_3 worksh python fuji R 0:04 1 cypress01-117 773980_4 worksh python fuji R 0:04 1 cypress01-117 773980_5 worksh python fuji R 0:04 1 cypress01-117 773980_6 worksh python fuji R 0:04 1 cypress01-117 773980_7 worksh python fuji R 0:04 1 cypress01-117 773980_8 worksh python fuji R 0:04 1 cypress01-117 773980_9 worksh python fuji R 0:04 1 cypress01-117 }}} To cancel '''773980_1''' only, {{{ [fuji@cypress1 JobArray2]$ scancel 773980_1 [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773980_0 worksh python fuji R 0:18 1 cypress01-117 773980_2 worksh python fuji R 0:18 1 cypress01-117 773980_3 worksh python fuji R 0:18 1 cypress01-117 773980_4 worksh python fuji R 0:18 1 cypress01-117 773980_5 worksh python fuji R 0:18 1 cypress01-117 773980_6 worksh python fuji R 0:18 1 cypress01-117 773980_7 worksh python fuji R 0:18 1 cypress01-117 773980_8 worksh python fuji R 0:18 1 cypress01-117 773980_9 worksh python fuji R 0:18 1 cypress01-117 }}} To cancel tasks 5-8, {{{ [fuji@cypress1 JobArray2]$ scancel 773980_[5-8] [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773980_0 worksh python fuji R 0:30 1 cypress01-117 773980_2 worksh python fuji R 0:30 1 cypress01-117 773980_3 worksh python fuji R 0:30 1 cypress01-117 773980_4 worksh python fuji R 0:30 1 cypress01-117 773980_9 worksh python fuji R 0:30 1 cypress01-117 }}} To cancel all tasks, {{{ [fuji@cypress1 JobArray2]$ scancel 773980 [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) }}}