Version 4 (modified by 5 years ago) ( diff ) | ,
---|
Job Array
If you haven't done yet, download Samples by:
git clone https://hidekiCCS@bitbucket.org/hidekiCCS/hpc-workshop.git
If you are running a large number of serial/parallel jobs that are independent each other, it is recommended to submit them as a job array to make the best use of your allocated resources.
Use Array Task ID as a Parameter to the code
Get into JobArray1 directory under workshop,
[fuji@cypress1 ~]$ cd workshop [fuji@cypress1 workshop]$ cd JobArray1/ [fuji@cypress1 JobArray1]$ ls hello2.py slurmscript
There is a python script that is
[fuji@cypress1 JobArray1]$ cat hello2.py # HELLO PYTHON import sys import datetime import socket if (len(sys.argv) < 2): print '%s [taskID]' % (sys.argv[0]) sys.exit() # taskID = int(sys.argv[1]) now = datetime.datetime.now() print 'Hello, world!' print now.isoformat() print socket.gethostname() print 'My Task ID = %d' % taskID
This python script takes a number from the command-line and prints it. You can run it on the login node as,
[fuji@cypress1 JobArray1]$ python ./hello2.py 123 Hello, world! 2018-08-22T13:47:28.572176 cypress1 My Task ID = 123
Look at slurmscript,
[fuji@cypress1 JobArray1]$ cat slurmscript #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 module load anaconda python hello2.py $SLURM_ARRAY_TASK_ID
array
This
#SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9
The argument can be specific array index values, a range of index values, and an optional step size. For example,
#SBATCH --array=0-31 # Submit a job array with index values between 0 and 31
#SBATCH --array=1,3,5,7 # Submit a job array with index values of 1, 3, 5 and 7
#SBATCH --array=1-7:2 # Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7)
Note that the minimum index value is zero.
Job ID and Environment Variables
Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value. SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job array. SLURM_ARRAY_TASK_MAX will be set to the highest job array index value. SLURM_ARRAY_TASK_MIN will be set to the lowest job array index value.
In slurmscript above, SLURM_ARRAY_TASK_ID is given to the python code as,
python hello2.py $SLURM_ARRAY_TASK_ID
so that the hello2.py prints the task ID.
Submit a Job
[fuji@cypress1 JobArray1]$ sbatch slurmscript Submitted batch job 773958
After the job completed, you will see 10 new files as (it may take a while)
[fuji@cypress1 JobArray1]$ ls hello2.py slurm-773958_1.out slurm-773958_3.out slurm-773958_5.out slurm-773958_7.out slurm-773958_9.out slurm-773958_0.out slurm-773958_2.out slurm-773958_4.out slurm-773958_6.out slurm-773958_8.out slurmscript
The default log file name is slurm-[Job ID]_[Task_ID].out.
[fuji@cypress1 JobArray1]$ cat slurm-773958_9.out Hello, world! 2018-08-22T14:01:44.492109 cypress01-117 My Task ID = 9
Use Array Task ID to define the script file name
Get into JobArray2 directory under workshop,
[fuji@cypress1 ~]$ cd workshop/JobArray2/ [fuji@cypress1 JobArray2]$ ls hello2.py script01.sh script03.sh script05.sh script07.sh script09.sh slurmscript2 script00.sh script02.sh script04.sh script06.sh script08.sh slurmscript1 [fuji@cypress1 JobArray2]$
Look at slurmscript1
[fuji@cypress1 JobArray2]$ cat slurmscript1 #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh
The last line, script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh makes the files name using $SLURM_ARRAY_TASK_ID, so it will run script00.sh, script01.sh, script00.sh, … and script09.sh.
Submit job,
[fuji@cypress1 JobArray2]$ sbatch slurmscript1 Submitted batch job 773970
Cancel Jobs in Job Array
Look at slurmscript12
[fuji@cypress1 JobArray2]$ cat slurmscript2 #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) #SBATCH --array=0-9 # Array of IDs=0,1,2,3,4,5,6,7,8,9 sh ./script$(printf "%02d" $SLURM_ARRAY_TASK_ID).sh sleep 60
There is sleep 60 so each task runs 60 seconds. Submit job and look at the jobs running/queued.
[fuji@cypress1 JobArray2]$ sbatch slurmscript2 Submitted batch job 773980 [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773980_0 worksh python fuji R 0:04 1 cypress01-117 773980_1 worksh python fuji R 0:04 1 cypress01-117 773980_2 worksh python fuji R 0:04 1 cypress01-117 773980_3 worksh python fuji R 0:04 1 cypress01-117 773980_4 worksh python fuji R 0:04 1 cypress01-117 773980_5 worksh python fuji R 0:04 1 cypress01-117 773980_6 worksh python fuji R 0:04 1 cypress01-117 773980_7 worksh python fuji R 0:04 1 cypress01-117 773980_8 worksh python fuji R 0:04 1 cypress01-117 773980_9 worksh python fuji R 0:04 1 cypress01-117
To cancel 773980_1 only,
[fuji@cypress1 JobArray2]$ scancel 773980_1 [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773980_0 worksh python fuji R 0:18 1 cypress01-117 773980_2 worksh python fuji R 0:18 1 cypress01-117 773980_3 worksh python fuji R 0:18 1 cypress01-117 773980_4 worksh python fuji R 0:18 1 cypress01-117 773980_5 worksh python fuji R 0:18 1 cypress01-117 773980_6 worksh python fuji R 0:18 1 cypress01-117 773980_7 worksh python fuji R 0:18 1 cypress01-117 773980_8 worksh python fuji R 0:18 1 cypress01-117 773980_9 worksh python fuji R 0:18 1 cypress01-117
To cancel tasks 5-8,
[fuji@cypress1 JobArray2]$ scancel 773980_[5-8] [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773980_0 worksh python fuji R 0:30 1 cypress01-117 773980_2 worksh python fuji R 0:30 1 cypress01-117 773980_3 worksh python fuji R 0:30 1 cypress01-117 773980_4 worksh python fuji R 0:30 1 cypress01-117 773980_9 worksh python fuji R 0:30 1 cypress01-117
To cancel all tasks,
[fuji@cypress1 JobArray2]$ scancel 773980 [fuji@cypress1 JobArray2]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON)