wiki:Workshops/cypress/JobArrays

Job Array

If you haven't done yet, download Samples by:

git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git


If you are running a large number of serial/parallel jobs that are independent each other, it is recommended to submit them as a job array to make the best use of your allocated resources.

Use Array Task ID as a Parameter to the code

Get into JobArray1 directory under workshop,

[fuji@cypress1 ~]$ cd workshop
[fuji@cypress1 workshop]$ cd JobArray1/
[fuji@cypress1 JobArray1]$ ls
hello2.py  slurmscript

There is a python script that is

[fuji@cypress1 JobArray1]$ cat hello2.py
# HELLO PYTHON
import sys
import datetime
import socket

if (len(sys.argv) < 2):
	print '%s [taskID]' % (sys.argv[0])
        sys.exit()
#
taskID = int(sys.argv[1])
now = datetime.datetime.now()
print 'Hello, world!'
print now.isoformat()
print socket.gethostname()
print 'My Task ID = %d' % taskID

This python script takes a number from the command-line and prints it. You can run it on the login node as,

[fuji@cypress1 JobArray1]$ python ./hello2.py 123
Hello, world!
2018-08-22T13:47:28.572176
cypress1
My Task ID = 123

Look at slurmscript,

[fuji@cypress1 JobArray1]$ cat slurmscript
#!/bin/bash
#SBATCH --qos=workshop            # Quality of Service
#SBATCH --partition=workshop      # partition
#SBATCH --job-name=python       # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9

module load anaconda
python hello2.py $SLURM_ARRAY_TASK_ID

array

This

#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9

The argument can be specific array index values, a range of index values, and an optional step size. For example,

#SBATCH --array=0-31  # Submit a job array with index values between 0 and 31
#SBATCH --array=1,3,5,7 # Submit a job array with index values of 1, 3, 5 and 7
#SBATCH --array=1-7:2 # Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7)

Note that the minimum index value is zero.

Job ID and Environment Variables

Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value. SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array. SLURM_ARRAY_TASK_MAX will be set to the highest job array index value. SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value.

In slurmscript above, SLURM_ARRAY_TASK_ID is given to the python code as,

python hello2.py $SLURM_ARRAY_TASK_ID

so that the hello2.py prints the task ID.

Submit a Job

[fuji@cypress1 JobArray1]$ sbatch slurmscript
Submitted batch job 773958

After the job completed, you will see 10 new files as (it may take a while)

[fuji@cypress1 JobArray1]$ ls
hello2.py           slurm-773958_1.out  slurm-773958_3.out  slurm-773958_5.out  slurm-773958_7.out  slurm-773958_9.out
slurm-773958_0.out  slurm-773958_2.out  slurm-773958_4.out  slurm-773958_6.out  slurm-773958_8.out  slurmscript

The default log file name is slurm-[Job ID]_[Task_ID].out.

[fuji@cypress1 JobArray1]$ cat slurm-773958_9.out
Hello, world!
2018-08-22T14:01:44.492109
cypress01-117
My Task ID = 9

Use Array Task ID to define the script file name

Get into JobArray2 directory under workshop,

[fuji@cypress1 ~]$ cd workshop/JobArray2/
[fuji@cypress1 JobArray2]$ ls
hello2.py    script01.sh  script03.sh  script05.sh  script07.sh  script09.sh   slurmscript2
script00.sh  script02.sh  script04.sh  script06.sh  script08.sh  slurmscript1
[fuji@cypress1 JobArray2]$

Look at slurmscript1

[fuji@cypress1 JobArray2]$ cat slurmscript1
#!/bin/bash
#SBATCH --qos=workshop            # Quality of Service
#SBATCH --partition=workshop      # partition
#SBATCH --job-name=python       # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9

sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh

The last line, script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh makes the files name using $SLURM_ARRAY_TASK_ID, so it will run script00.sh, script01.sh, script00.sh, … and script09.sh.

Submit job,

[fuji@cypress1 JobArray2]$ sbatch slurmscript1
Submitted batch job 773970

Cancel Jobs in Job Array

Look at slurmscript12

[fuji@cypress1 JobArray2]$ cat slurmscript2
#!/bin/bash
#SBATCH --qos=workshop            # Quality of Service
#SBATCH --partition=workshop      # partition
#SBATCH --job-name=python       # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)
#SBATCH --array=0-9             # Array of IDs=0,1,2,3,4,5,6,7,8,9

sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh

sleep 60

There is sleep 60 so each task runs 60 seconds. Submit job and look at the jobs running/queued.

[fuji@cypress1 JobArray2]$ sbatch slurmscript2
Submitted batch job 773980
[fuji@cypress1 JobArray2]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
  773980_0 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_1 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_2 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_3 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_4 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_5 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_6 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_7 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_8 worksh             python     fuji  R       0:04  1 cypress01-117
  773980_9 worksh             python     fuji  R       0:04  1 cypress01-117

To cancel 773980_1 only,

[fuji@cypress1 JobArray2]$ scancel 773980_1
[fuji@cypress1 JobArray2]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
  773980_0 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_2 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_3 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_4 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_5 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_6 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_7 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_8 worksh             python     fuji  R       0:18  1 cypress01-117
  773980_9 worksh             python     fuji  R       0:18  1 cypress01-117

To cancel tasks 5-8,

[fuji@cypress1 JobArray2]$ scancel 773980_[5-8]
[fuji@cypress1 JobArray2]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
  773980_0 worksh             python     fuji  R       0:30  1 cypress01-117
  773980_2 worksh             python     fuji  R       0:30  1 cypress01-117
  773980_3 worksh             python     fuji  R       0:30  1 cypress01-117
  773980_4 worksh             python     fuji  R       0:30  1 cypress01-117
  773980_9 worksh             python     fuji  R       0:30  1 cypress01-117

To cancel all tasks,

[fuji@cypress1 JobArray2]$ scancel 773980
[fuji@cypress1 JobArray2]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
Last modified 3 days ago Last modified on Aug 21, 2019 11:23:22 AM