Version 1 (modified by 2 days ago) ( diff ) | ,
---|
Example Pipeline for Data Transfer Using Globus and Computation in Batch Jobs
Here we consider a pipeline to do the following processes.
- Transfer data files from Box to the Cypress Lustre directory.
- Perform computation using the data.
- Transfer the results to Box.
- Delete files in Cypress Lustre.
Scripts
Job Submission Script
submitJob.sh
# # Pipeline for Data Transfer Using Globus and Computation # # Job name JOB_NAME="COMPUTING1" # Set path export BOX_DATA_DIR="/Test/" export CYPRESS_WORK_DIR="/lustre/project/group/userid/test/" export BOX_RESULT_DIR="/Test_result/" # # Submit a job to transfer data from Box to Cypress JOB1=`sbatch --job-name=${JOB_NAME}_DL ./transferData.sh DOWNLOAD KEEP | awk '{print $4}'`; echo $JOB1 "Submitted" # Submit a job to process data on Cypress JOB2=`sbatch --job-name=${JOB_NAME} --dependency=afterok:$JOB1 ./computing.sh | awk '{print $4}'`; echo $JOB2 "Submitted" # Submit a job to transfer data from Cypress to Box JOB3=`sbatch --job-name=${JOB_NAME}_UL --dependency=afterok:$JOB2 ./transferData.sh UPLOAD DELETE | awk '{print $4}'`; echo $JOB3 "Submitted"
JOB_NAME is the job name.
BOX_DATA_DIR is the directory in Box where the source data is stored.
CYPRESS_WORK_DIR is the directory where the downloaded data is stored.
BOX_RESULT_DIR is the directory where results are uploaded in Box.
Data Transfer Script
transferData.sh
#!/bin/bash #SBATCH --partition=centos7 #SBATCH --qos=long #SBATCH --time=7-00:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 # Check options if [ $# -ne 2 ]; then echo 'Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]' exit 1 fi # Check path if [[ -z "${BOX_DATA_DIR}" ]]; then echo "ERROR! BOX_DATA_DIR isn't set." exit 1 fi if [[ -z "${CYPRESS_WORK_DIR}" ]]; then echo "ERROR! CYPRESS_WORK_DIR isn't set." exit 1 fi if [[ -z "${BOX_RESULT_DIR}" ]]; then echo "ERROR! BOX_RESULT_DIR isn't set." exit 1 fi # Start Globus Connect module load globusconnectpersonal/3.2.5 globusconnect -start & # Set up CLI environment source activate globus-cli # Obtain local UUID MY_UUID=$(globus endpoint local-id) uuid_code=$? if [ $uuid_code -ne 0 ]; then echo "ERROR! Globus Connect isn't activated." globusconnect -stop exit 1 fi # Make the source and destination path if [[ "$1" == "DOWNLOAD" ]]; then SOURCE_EP=$TULANE_BOX:$BOX_DATA_DIR DEST_EP=$MY_UUID:$CYPRESS_WORK_DIR else SOURCE_EP=$MY_UUID:$CYPRESS_WORK_DIR DEST_EP=$TULANE_BOX:$BOX_DATA_DIR fi # Check logged in to Globus output=$(globus whoami >/dev/null 2>&1) output_code=$? if [ $output_code -ne 0 ]; then echo "ERROR! Not logged in to Globus" globusconnect -stop exit 1 fi task_id=$(globus transfer "$SOURCE_EP" "$DEST_EP" --label "$SLURM_JOB_NAME" | tail -1 | awk '{print $3}') output_code=$? if [ $output_code -ne 0 ]; then echo "ERROR! The transfer of data in could not be started." globusconnect -stop exit 1 fi # wait util the task done. output=$(globus task wait $task_id) output_code=$? if [ $output_code -ne 0 ]; then echo "ERROR! The transfer of data was failed." globus task cancel $task_id globusconnect -stop exit 1 fi # Check if the delete option is set if [[ "$2" == "DELETE" ]]; then task_id=$(globus rm --recursive $SOURCE_EP |& awk '{print $6}' | sed -e "s/\"//g") globus task wait $task_id fi # done successfully source deactivate globus-cli globusconnect -stop exit 0
Computing Script
computing.sh
#!/bin/bash #SBATCH --partition=defq #SBATCH --qos=normal #SBATCH --time=1-00:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 # cd to working directory cd ${CYPRESS_WORK_DIR} pwd # module load ... computing something touch RES sleep 5 #done exit 0
How to submit a job
sh ./SubmitJob.sh
Note:
See TracWiki
for help on using the wiki.