| Version 1 (modified by , 7 months ago) ( diff ) |
|---|
Example Pipeline for Data Transfer Using Globus and Computation in Batch Jobs
Here we consider a pipeline to do the following processes.
- Transfer data files from Box to the Cypress Lustre directory.
- Perform computation using the data.
- Transfer the results to Box.
- Delete files in Cypress Lustre.
Scripts
Job Submission Script
submitJob.sh
#
# Pipeline for Data Transfer Using Globus and Computation
#
# Job name
JOB_NAME="COMPUTING1"
# Set path
export BOX_DATA_DIR="/Test/"
export CYPRESS_WORK_DIR="/lustre/project/group/userid/test/"
export BOX_RESULT_DIR="/Test_result/"
#
# Submit a job to transfer data from Box to Cypress
JOB1=`sbatch --job-name=${JOB_NAME}_DL ./transferData.sh DOWNLOAD KEEP | awk '{print $4}'`;
echo $JOB1 "Submitted"
# Submit a job to process data on Cypress
JOB2=`sbatch --job-name=${JOB_NAME} --dependency=afterok:$JOB1 ./computing.sh | awk '{print $4}'`;
echo $JOB2 "Submitted"
# Submit a job to transfer data from Cypress to Box
JOB3=`sbatch --job-name=${JOB_NAME}_UL --dependency=afterok:$JOB2 ./transferData.sh UPLOAD DELETE | awk '{print $4}'`;
echo $JOB3 "Submitted"
JOB_NAME is the job name.
BOX_DATA_DIR is the directory in Box where the source data is stored.
CYPRESS_WORK_DIR is the directory where the downloaded data is stored.
BOX_RESULT_DIR is the directory where results are uploaded in Box.
Data Transfer Script
transferData.sh
#!/bin/bash
#SBATCH --partition=centos7
#SBATCH --qos=long
#SBATCH --time=7-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
# Check options
if [ $# -ne 2 ]; then
echo 'Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]'
exit 1
fi
# Check path
if [[ -z "${BOX_DATA_DIR}" ]]; then
echo "ERROR! BOX_DATA_DIR isn't set."
exit 1
fi
if [[ -z "${CYPRESS_WORK_DIR}" ]]; then
echo "ERROR! CYPRESS_WORK_DIR isn't set."
exit 1
fi
if [[ -z "${BOX_RESULT_DIR}" ]]; then
echo "ERROR! BOX_RESULT_DIR isn't set."
exit 1
fi
# Start Globus Connect
module load globusconnectpersonal/3.2.5
globusconnect -start &
# Set up CLI environment
source activate globus-cli
# Obtain local UUID
MY_UUID=$(globus endpoint local-id)
uuid_code=$?
if [ $uuid_code -ne 0 ]; then
echo "ERROR! Globus Connect isn't activated."
globusconnect -stop
exit 1
fi
# Make the source and destination path
if [[ "$1" == "DOWNLOAD" ]]; then
SOURCE_EP=$TULANE_BOX:$BOX_DATA_DIR
DEST_EP=$MY_UUID:$CYPRESS_WORK_DIR
else
SOURCE_EP=$MY_UUID:$CYPRESS_WORK_DIR
DEST_EP=$TULANE_BOX:$BOX_DATA_DIR
fi
# Check logged in to Globus
output=$(globus whoami >/dev/null 2>&1)
output_code=$?
if [ $output_code -ne 0 ]; then
echo "ERROR! Not logged in to Globus"
globusconnect -stop
exit 1
fi
task_id=$(globus transfer "$SOURCE_EP" "$DEST_EP" --label "$SLURM_JOB_NAME" | tail -1 | awk '{print $3}')
output_code=$?
if [ $output_code -ne 0 ]; then
echo "ERROR! The transfer of data in could not be started."
globusconnect -stop
exit 1
fi
# wait util the task done.
output=$(globus task wait $task_id)
output_code=$?
if [ $output_code -ne 0 ]; then
echo "ERROR! The transfer of data was failed."
globus task cancel $task_id
globusconnect -stop
exit 1
fi
# Check if the delete option is set
if [[ "$2" == "DELETE" ]]; then
task_id=$(globus rm --recursive $SOURCE_EP |& awk '{print $6}' | sed -e "s/\"//g")
globus task wait $task_id
fi
# done successfully
source deactivate globus-cli
globusconnect -stop
exit 0
Computing Script
computing.sh
#!/bin/bash
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
# cd to working directory
cd ${CYPRESS_WORK_DIR}
pwd
# module load ... computing something
touch RES
sleep 5
#done
exit 0
How to submit a job
sh ./SubmitJob.sh
Note:
See TracWiki
for help on using the wiki.
