Version 2 (modified by 6 years ago) ( diff ) | ,
---|
Job Dependency
If you haven't done yet, download Samples by:
svn co file:///home/fuji/repos/workshop ./workshop
Checkout Sample files onto local machine, (linux shell)
svn co svn+ssh://USERID@cypress1.tulane.edu/home/fuji/repos/workshop ./workshop
Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the —dependency option to sbatch command.
sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...
Dependency types:
- after:jobid[:jobid…] job can begin after the specified jobs have started
- afterany:jobid[:jobid…] job can begin after the specified jobs have terminated
- afternotok:jobid[:jobid…] job can begin after the specified jobs have failed
- afterok:jobid[:jobid…] job can begin after the specified jobs have run to completion with an exit code of zero (see the user guide for caveats).
Submitting Dependent Jobs
Get into JobDependencies directory under workshop,
[fuji@cypress1 ~]$ cd workshop/ [fuji@cypress1 workshop]$ cd JobDependencies [fuji@cypress1 JobDependencies]$ ls addOne.py number.dat script.sh slurmscript SubmitDependentJobs.sh
Python code addOne.py reads number.dat and gets an integer number, and then adds one and stores it back to number.dat .
[fuji@cypress1 JobDependencies]$ cat addOne.py # HELLO PYTHON import datetime import socket now = datetime.datetime.now() print 'Hello, world!' print now.isoformat() print socket.gethostname() # with open('number.dat','r') as f: data = f.readline() number = int(data) # print "Number = %d" % number with open('number.dat','w') as f: f.write(str(number + 1)) #
slurmscipt just run the code,
[fuji@cypress1 JobDependencies]$ cat slurmscript #!/bin/bash #SBATCH --qos=workshop # Quality of Service #SBATCH --partition=workshop # partition #SBATCH --job-name=python # Job Name #SBATCH --time=00:01:00 # WallTime #SBATCH --nodes=1 # Number of Nodes #SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes) #SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads) module load anaconda python addOne.py sleep 10
Let submit one job and then submit anther job that depends on the first jobs as,
[fuji@cypress1 JobDependencies]$ sbatch slurmscript Submitted batch job 773997 [fuji@cypress1 JobDependencies]$ sbatch --dependency=afterok:773997 slurmscript Submitted batch job 773998
List the jobs,
[fuji@cypress1 JobDependencies]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773998 worksh python fuji PD 0:00 1 (Dependency) 773997 worksh python fuji R 0:00 1 cypress01-117
After the first job completed, the second job begin to run,
[fuji@cypress1 JobDependencies]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 773998 worksh python fuji R 0:05 1 cypress01-117
The results are:
[fuji@cypress1 JobDependencies]$ ls addOne.py number.dat script.sh slurm-773997.out slurm-773998.out slurmscript SubmitDependentJobs.sh [fuji@cypress1 JobDependencies]$ cat slurm-773997.out Hello, world! 2018-08-22T14:55:37.421310 cypress01-117 Number = 2 [fuji@cypress1 JobDependencies]$ cat slurm-773998.out Hello, world! 2018-08-22T14:55:47.619183 cypress01-117 Number = 3
Submitting Many Dependent Jobs with Bash Script
Look at SubmitDependentJobs.sh
[fuji@cypress1 JobDependencies]$ cat SubmitDependentJobs.sh #!/bin/bash EMAIL=$USER@tulane.edu WALLTIME_LIMIT=1:00:00 export WORKDIR=`pwd` # QUEUE='--partition=workshop --qos=workshop' WALLTIME="--time=$WALLTIME_LIMIT" RESORCE="--nodes=1 --ntasks-per-node=1 --cpus-per-task=1" OTHERS="--export=ALL --mail-type=END --mail-user=$EMAIL" # JOB_SETTING="$QUEUE $WALLTIME $RESORCE $OTHERS" DEPENDENCY="" while [[ $# > 0 ]] do JOB=`sbatch --job-name=$DIRNAME$1 $DEPENDENCY $JOB_SETTING ./$1 | awk '{print $4}'`; echo $JOB submitted; DEPENDENCY="--dependency=afterok:$JOB" ; shift done
This bash script takes script names as command-line options, and submits a sequence of dependent jobs with those scripts.
The bash script, script.sh is
[fuji@cypress1 JobDependencies]$ cat script.sh #!/bin/bash module load anaconda python addOne.py sleep 1
runs addOne.py.
Let's submit 10 of script.sh,
[fuji@cypress1 JobDependencies]$ ./SubmitDependentJobs.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh script.sh 774001 submitted 774002 submitted 774003 submitted 774004 submitted 774005 submitted 774006 submitted 774007 submitted 774008 submitted 774009 submitted 774010 submitted
List jobs,
[fuji@cypress1 JobDependencies]$ squeue -u fuji JOBID QOS NAME USER ST TIME NO NODELIST(REASON) 774005 worksh script.sh fuji PD 0:00 1 (Dependency) 774006 worksh script.sh fuji PD 0:00 1 (Dependency) 774007 worksh script.sh fuji PD 0:00 1 (Dependency) 774008 worksh script.sh fuji PD 0:00 1 (Dependency) 774009 worksh script.sh fuji PD 0:00 1 (Dependency) 774010 worksh script.sh fuji PD 0:00 1 (Dependency) 774004 worksh script.sh fuji R 0:01 1 cypress01-117