= Example Pipeline for Data Transfer Using Globus and Computation in Batch Jobs =
Here we consider a pipeline to do the following processes.
1. Transfer data files from Box to the Cypress Lustre directory.
2. Perform computation using the data.
3. Transfer the results to Box.
4. Delete files in Cypress Lustre.

To do this, first of all, the user must '''log in to Globus with the command line tool once''', see [https://wiki.hpc.tulane.edu/trac/wiki/cypress/Globus#FileTransferwithGlobusCommandlineTools here].
== Scripts ==
There are three scripts.

=== Job Submission Script ===
'''submitJob.sh'''
{{{
#
# Pipeline for Data Transfer Using Globus and Computation
#
# Job name
JOB_NAME="COMPUTING1"

# Set path
export BOX_DATA_DIR="/Test/"
export CYPRESS_WORK_DIR="/lustre/project/group/userid/test/"
export BOX_RESULT_DIR="/Test_result/"
#
# Submit a job to transfer data from Box to Cypress
JOB1=`sbatch --job-name=${JOB_NAME}_DL ./transferData.sh DOWNLOAD KEEP | awk '{print $4}'`;
echo $JOB1 "Submitted"

# Submit a job to process data on Cypress
JOB2=`sbatch --job-name=${JOB_NAME} --dependency=afterok:$JOB1 ./computing.sh | awk '{print $4}'`;
echo $JOB2 "Submitted"

# Submit a job to transfer data from Cypress to Box
JOB3=`sbatch --job-name=${JOB_NAME}_UL --dependency=afterok:$JOB2 ./transferData.sh UPLOAD DELETE | awk '{print $4}'`;
echo $JOB3 "Submitted"
}}}
The user must set the following entries.
 * '''JOB_NAME''' is the job name.
 * '''BOX_DATA_DIR''' is the directory in Box where the source data is stored.
 * '''CYPRESS_WORK_DIR''' is the directory where the downloaded data is stored.
 * '''BOX_RESULT_DIR''' is the directory where results are uploaded in Box.

For each transfer, the user must set whether to keep or delete the source directory. (see below)

Note that the directory on Cypress must be set with write-access by '''~/.globusonline/lta/config-paths'''.

=== Data Transfer Script ===
'''transferData.sh'''
{{{
#!/bin/bash
#SBATCH --partition=centos7
#SBATCH --qos=long
#SBATCH --time=7-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1

# Check options
if [ $# -ne 2 ]; then
    echo 'Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]'
    exit 1
fi

# Check path
if [[ -z "${BOX_DATA_DIR}" ]]; then
    echo "ERROR!  BOX_DATA_DIR isn't set."
    exit 1
fi
if [[ -z "${CYPRESS_WORK_DIR}" ]]; then
    echo "ERROR!  CYPRESS_WORK_DIR isn't set."
    exit 1
fi
if [[ -z "${BOX_RESULT_DIR}" ]]; then
    echo "ERROR!  BOX_RESULT_DIR isn't set."
    exit 1
fi

# Start Globus Connect
module load globusconnectpersonal/3.2.5
globusconnect -start &

# Set up CLI environment
source activate globus-cli

# Obtain local UUID
MY_UUID=$(globus endpoint local-id)
uuid_code=$?
if [ $uuid_code -ne 0 ]; then
    echo "ERROR!  Globus Connect isn't activated."
    globusconnect -stop
    exit 1
fi

# Make the source and destination path
if [[ "$1" == "DOWNLOAD" ]]; then
    SOURCE_EP=$TULANE_BOX:$BOX_DATA_DIR
    DEST_EP=$MY_UUID:$CYPRESS_WORK_DIR
else
    SOURCE_EP=$MY_UUID:$CYPRESS_WORK_DIR
    DEST_EP=$TULANE_BOX:$BOX_DATA_DIR    
fi

# Check logged in to Globus
output=$(globus whoami >/dev/null 2>&1)
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  Not logged in to Globus"
    globusconnect -stop
    exit 1
fi

task_id=$(globus transfer "$SOURCE_EP" "$DEST_EP" --label "$SLURM_JOB_NAME" | tail -1 | awk '{print $3}')
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  The transfer of data in could not be started."
    globusconnect -stop
    exit 1
fi

# wait util the task done.
output=$(globus task wait $task_id)
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  The transfer of data was failed."
    globus task cancel $task_id
    globusconnect -stop
    exit 1
fi

# Check if the delete option is set
if [[ "$2" == "DELETE" ]]; then
    task_id=$(globus rm --recursive $SOURCE_EP |& awk '{print $6}' | sed -e "s/\"//g")
    globus task wait $task_id
fi

# done successfully
source deactivate globus-cli
globusconnect -stop
exit 0
}}}
The user doesn't have to edit this file. But in '''submitJob.sh''', the user should set the keep or delete option.
 '''Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]'''
The first parameter '''[DOWNLOAD | UPLOAD]''', which
 * When '''DOWNLOAD''' is set, it downloads data from Box to Cypress.
 * When '''UPLOAD''' is set, it uploads data from Cypress to Box.
The second parameter '''[KEEP | DELETE]''', which
 * When '''KEEP''' is set, data is kept.
 * When '''DELETE''' is set, it deletes the source directory. If '''DOWNLOAD''' is set, it deletes files in Box. If '''UPLOAD''' is set, it deletes files in Cypress.

=== Computing Script ===
'''computing.sh'''
{{{
#!/bin/bash
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20

# cd to working directory
cd ${CYPRESS_WORK_DIR}
pwd

# module load ... computing something
touch Results.txt
sleep 5

#done
exit 0
}}}
This is the main computing script, which the user has to edit. It doesn't have to use CentOS7 nodes.


== How to submit a job ==
On login node, '''submitJob.sh''', '''transferData.sh''', and '''computing.sh''' must be in the same directory.
{{{
sh ./submitJob.sh
}}}
This submits three jobs. The first job downloads data from Box. The second job depends on the first job and performs the computing task. The third job depends on the second job and uploads the results to Box.