wiki:Workshops/cypress/JobDependency

Job Dependency

If you haven't done yet, download Samples by:

git clone https://hidekiCCS:@bitbucket.org/hidekiCCS/hpc-workshop.git


Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the —dependency option to sbatch command.

sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...

Dependency types:

  • after:jobid[:jobid…] job can begin after the specified jobs have started
  • afterany:jobid[:jobid…] job can begin after the specified jobs have terminated
  • afternotok:jobid[:jobid…] job can begin after the specified jobs have failed
  • afterok:jobid[:jobid…] job can begin after the specified jobs have run to completion with an exit code of zero (see the user guide for caveats).

Submitting Dependent Jobs

Get into JobDependencies directory under workshop,

[fuji@cypress1 ~]$ cd workshop/
[fuji@cypress1 workshop]$ cd JobDependencies
[fuji@cypress1 JobDependencies]$ ls
addOne.py  number.dat  script.sh  slurmscript  SubmitDependentJobs.sh

Python code addOne.py reads number.dat and gets an integer number, and then adds one and stores it back to number.dat .

[fuji@cypress1 JobDependencies]$ cat addOne.py
# HELLO PYTHON
import datetime
import socket

now = datetime.datetime.now()
print 'Hello, world!'
print now.isoformat()
print socket.gethostname()
#
with open('number.dat','r') as f:
	data = f.readline()
	number = int(data)
#
print "Number = %d" % number
with open('number.dat','w') as f:
	f.write(str(number + 1))
#

slurmscipt just run the code,

[fuji@cypress1 JobDependencies]$ cat slurmscript
#!/bin/bash
#SBATCH --qos=workshop            # Quality of Service
#SBATCH --partition=workshop      # partition
#SBATCH --job-name=python       # Job Name
#SBATCH --time=00:01:00         # WallTime
#SBATCH --nodes=1               # Number of Nodes
#SBATCH --ntasks-per-node=1     # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1       # Number of threads per task (OMP threads)

module load anaconda
python addOne.py

sleep 10

Let submit one job and then submit anther job that depends on the first jobs as,

[fuji@cypress1 JobDependencies]$ sbatch slurmscript
Submitted batch job 773997
[fuji@cypress1 JobDependencies]$ sbatch --dependency=afterok:773997 slurmscript
Submitted batch job 773998

List the jobs,

[fuji@cypress1 JobDependencies]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
    773998 worksh             python     fuji PD       0:00  1 (Dependency)
    773997 worksh             python     fuji  R       0:00  1 cypress01-117

After the first job completed, the second job begin to run,

[fuji@cypress1 JobDependencies]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
    773998 worksh             python     fuji  R       0:05  1 cypress01-117

The results are:

[fuji@cypress1 JobDependencies]$ ls
addOne.py  number.dat  script.sh  slurm-773997.out  slurm-773998.out  slurmscript  SubmitDependentJobs.sh
[fuji@cypress1 JobDependencies]$ cat slurm-773997.out
Hello, world!
2018-08-22T14:55:37.421310
cypress01-117
Number = 2
[fuji@cypress1 JobDependencies]$ cat slurm-773998.out
Hello, world!
2018-08-22T14:55:47.619183
cypress01-117
Number = 3

Submitting Many Dependent Jobs with Bash Script

Look at SubmitDependentJobs.sh

[fuji@cypress1 JobDependencies]$ cat SubmitDependentJobs.sh
#!/bin/bash
EMAIL=$USER@tulane.edu
WALLTIME_LIMIT=1:00:00
export WORKDIR=`pwd`
#
QUEUE='--partition=workshop --qos=workshop'
WALLTIME="--time=$WALLTIME_LIMIT"
RESORCE="--nodes=1 --ntasks-per-node=1 --cpus-per-task=1"
OTHERS="--export=ALL --mail-type=END --mail-user=$EMAIL"
#
JOB_SETTING="$QUEUE $WALLTIME $RESORCE $OTHERS"

DEPENDENCY=""

while [[ $# > 0 ]]
do
	JOB=`sbatch --job-name=$1 $DEPENDENCY $JOB_SETTING ./$1 | awk '{print $4}'`;
	echo $JOB submitted;
	DEPENDENCY="--dependency=afterok:$JOB" ;
	shift
done

This bash script takes script names as command-line options, and submits a sequence of dependent jobs with those scripts.

The bash script, script.sh

[fuji@cypress1 JobDependencies]$ cat script.sh
#!/bin/bash

module load anaconda
python addOne.py

sleep 1

just runs addOne.py.

Let's submit 10 of script.sh,

[fuji@cypress1 JobDependencies]$ ./SubmitDependentJobs.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh
774001 submitted
774002 submitted
774003 submitted
774004 submitted
774005 submitted
774006 submitted
774007 submitted
774008 submitted
774009 submitted
774010 submitted

List jobs,

[fuji@cypress1 JobDependencies]$ squeue -u fuji
     JOBID    QOS               NAME     USER ST       TIME NO NODELIST(REASON)
    774005 worksh          script.sh     fuji PD       0:00  1 (Dependency)
    774006 worksh          script.sh     fuji PD       0:00  1 (Dependency)
    774007 worksh          script.sh     fuji PD       0:00  1 (Dependency)
    774008 worksh          script.sh     fuji PD       0:00  1 (Dependency)
    774009 worksh          script.sh     fuji PD       0:00  1 (Dependency)
    774010 worksh          script.sh     fuji PD       0:00  1 (Dependency)
    774004 worksh          script.sh     fuji  R       0:01  1 cypress01-117
Last modified 3 days ago Last modified on Aug 21, 2019 11:23:46 AM