= Example Pipeline for Data Transfer Using Globus and Computation in Batch Jobs = Here we consider a pipeline to do the following processes. 1. Transfer data files from Box to the Cypress Lustre directory. 2. Perform computation using the data. 3. Transfer the results to Box. 4. Delete files in Cypress Lustre. To do this, first of all, the user must '''log in to Globus with the command line tool once''', see [https://wiki.hpc.tulane.edu/trac/wiki/cypress/Globus#FileTransferwithGlobusCommandlineTools here]. == Scripts == There are three scripts. === Job Submission Script === '''submitJob.sh''' {{{ # # Pipeline for Data Transfer Using Globus and Computation # # Job name JOB_NAME="COMPUTING1" # Set path export BOX_DATA_DIR="/Test/" export CYPRESS_WORK_DIR="/lustre/project/group/userid/test/" export BOX_RESULT_DIR="/Test_result/" # # Submit a job to transfer data from Box to Cypress JOB1=`sbatch --job-name=${JOB_NAME}_DL ./transferData.sh DOWNLOAD KEEP | awk '{print $4}'`; echo $JOB1 "Submitted" # Submit a job to process data on Cypress JOB2=`sbatch --job-name=${JOB_NAME} --dependency=afterok:$JOB1 ./computing.sh | awk '{print $4}'`; echo $JOB2 "Submitted" # Submit a job to transfer data from Cypress to Box JOB3=`sbatch --job-name=${JOB_NAME}_UL --dependency=afterok:$JOB2 ./transferData.sh UPLOAD DELETE | awk '{print $4}'`; echo $JOB3 "Submitted" }}} The user must set the following entries. * '''JOB_NAME''' is the job name. * '''BOX_DATA_DIR''' is the directory in Box where the source data is stored. * '''CYPRESS_WORK_DIR''' is the directory where the downloaded data is stored. * '''BOX_RESULT_DIR''' is the directory where results are uploaded in Box. For each transfer, the user must set whether to keep or delete the source directory. (see below) Note that the directory on Cypress must be set with write-access by '''~/.globusonline/lta/config-paths'''. === Data Transfer Script === '''transferData.sh''' {{{ #!/bin/bash #SBATCH --partition=centos7 #SBATCH --qos=long #SBATCH --time=7-00:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 # Check options if [ $# -ne 2 ]; then echo 'Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]' exit 1 fi # Check path if [[ -z "${BOX_DATA_DIR}" ]]; then echo "ERROR! BOX_DATA_DIR isn't set." exit 1 fi if [[ -z "${CYPRESS_WORK_DIR}" ]]; then echo "ERROR! CYPRESS_WORK_DIR isn't set." exit 1 fi if [[ -z "${BOX_RESULT_DIR}" ]]; then echo "ERROR! BOX_RESULT_DIR isn't set." exit 1 fi # Start Globus Connect module load globusconnectpersonal/3.2.5 globusconnect -start & # Set up CLI environment source activate globus-cli # Obtain local UUID MY_UUID=$(globus endpoint local-id) uuid_code=$? if [ $uuid_code -ne 0 ]; then echo "ERROR! Globus Connect isn't activated." globusconnect -stop exit 1 fi # Make the source and destination path if [[ "$1" == "DOWNLOAD" ]]; then SOURCE_EP=$TULANE_BOX:$BOX_DATA_DIR DEST_EP=$MY_UUID:$CYPRESS_WORK_DIR else SOURCE_EP=$MY_UUID:$CYPRESS_WORK_DIR DEST_EP=$TULANE_BOX:$BOX_DATA_DIR fi # Check logged in to Globus output=$(globus whoami >/dev/null 2>&1) output_code=$? if [ $output_code -ne 0 ]; then echo "ERROR! Not logged in to Globus" globusconnect -stop exit 1 fi task_id=$(globus transfer "$SOURCE_EP" "$DEST_EP" --label "$SLURM_JOB_NAME" | tail -1 | awk '{print $3}') output_code=$? if [ $output_code -ne 0 ]; then echo "ERROR! The transfer of data in could not be started." globusconnect -stop exit 1 fi # wait util the task done. output=$(globus task wait $task_id) output_code=$? if [ $output_code -ne 0 ]; then echo "ERROR! The transfer of data was failed." globus task cancel $task_id globusconnect -stop exit 1 fi # Check if the delete option is set if [[ "$2" == "DELETE" ]]; then task_id=$(globus rm --recursive $SOURCE_EP |& awk '{print $6}' | sed -e "s/\"//g") globus task wait $task_id fi # done successfully source deactivate globus-cli globusconnect -stop exit 0 }}} The user doesn't have to edit this file. But in '''submitJob.sh''', the user should set the keep or delete option. '''Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]''' The first parameter '''[DOWNLOAD | UPLOAD]''', which * When '''DOWNLOAD''' is set, it downloads data from Box to Cypress. * When '''UPLOAD''' is set, it uploads data from Cypress to Box. The second parameter '''[KEEP | DELETE]''', which * When '''KEEP''' is set, data is kept. * When '''DELETE''' is set, it deletes the source directory. If '''DOWNLOAD''' is set, it deletes files in Box. If '''UPLOAD''' is set, it deletes files in Cypress. === Computing Script === '''computing.sh''' {{{ #!/bin/bash #SBATCH --partition=defq #SBATCH --qos=normal #SBATCH --time=1-00:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=20 # cd to working directory cd ${CYPRESS_WORK_DIR} pwd # module load ... computing something touch Results.txt sleep 5 #done exit 0 }}} This is the main computing script, which the user has to edit. It doesn't have to use CentOS7 nodes. == How to submit a job == On login node, '''submitJob.sh''', '''transferData.sh''', and '''computing.sh''' must be in the same directory. {{{ sh ./submitJob.sh }}} This submits three jobs. The first job downloads data from Box. The second job depends on the first job and performs the computing task. The third job depends on the second job and uploads the results to Box.