wiki:cypress/GlobusInBarchJob

Example Pipeline for Data Transfer Using Globus and Computation in Batch Jobs

Here we consider a pipeline to do the following processes.

  1. Transfer data files from Box to the Cypress Lustre directory.
  2. Perform computation using the data.
  3. Transfer the results to Box.
  4. Delete files in Cypress Lustre.

To do this, first of all, the user must log in to Globus with the command line tool once, see here.

Scripts

There are three scripts.

Job Submission Script

submitJob.sh

#
# Pipeline for Data Transfer Using Globus and Computation
#
# Job name
JOB_NAME="COMPUTING1"

# Set path
export BOX_DATA_DIR="/Test/"
export CYPRESS_WORK_DIR="/lustre/project/group/userid/test/"
export BOX_RESULT_DIR="/Test_result/"
#
# Submit a job to transfer data from Box to Cypress
JOB1=`sbatch --job-name=${JOB_NAME}_DL ./transferData.sh DOWNLOAD KEEP | awk '{print $4}'`;
echo $JOB1 "Submitted"

# Submit a job to process data on Cypress
JOB2=`sbatch --job-name=${JOB_NAME} --dependency=afterok:$JOB1 ./computing.sh | awk '{print $4}'`;
echo $JOB2 "Submitted"

# Submit a job to transfer data from Cypress to Box
JOB3=`sbatch --job-name=${JOB_NAME}_UL --dependency=afterok:$JOB2 ./transferData.sh UPLOAD DELETE | awk '{print $4}'`;
echo $JOB3 "Submitted"

The user must set the following entries.

  • JOB_NAME is the job name.
  • BOX_DATA_DIR is the directory in Box where the source data is stored.
  • CYPRESS_WORK_DIR is the directory where the downloaded data is stored.
  • BOX_RESULT_DIR is the directory where results are uploaded in Box.

For each transfer, the user must set whether to keep or delete the source directory. (see below)

Note that the directory on Cypress must be set with write-access by ~/.globusonline/lta/config-paths.

Data Transfer Script

transferData.sh

#!/bin/bash
#SBATCH --partition=centos7
#SBATCH --qos=long
#SBATCH --time=7-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1

# Check options
if [ $# -ne 2 ]; then
    echo 'Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]'
    exit 1
fi

# Check path
if [[ -z "${BOX_DATA_DIR}" ]]; then
    echo "ERROR!  BOX_DATA_DIR isn't set."
    exit 1
fi
if [[ -z "${CYPRESS_WORK_DIR}" ]]; then
    echo "ERROR!  CYPRESS_WORK_DIR isn't set."
    exit 1
fi
if [[ -z "${BOX_RESULT_DIR}" ]]; then
    echo "ERROR!  BOX_RESULT_DIR isn't set."
    exit 1
fi

# Start Globus Connect
module load globusconnectpersonal/3.2.5
globusconnect -start &

# Set up CLI environment
source activate globus-cli

# Obtain local UUID
MY_UUID=$(globus endpoint local-id)
uuid_code=$?
if [ $uuid_code -ne 0 ]; then
    echo "ERROR!  Globus Connect isn't activated."
    globusconnect -stop
    exit 1
fi

# Make the source and destination path
if [[ "$1" == "DOWNLOAD" ]]; then
    SOURCE_EP=$TULANE_BOX:$BOX_DATA_DIR
    DEST_EP=$MY_UUID:$CYPRESS_WORK_DIR
else
    SOURCE_EP=$MY_UUID:$CYPRESS_WORK_DIR
    DEST_EP=$TULANE_BOX:$BOX_DATA_DIR    
fi

# Check logged in to Globus
output=$(globus whoami >/dev/null 2>&1)
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  Not logged in to Globus"
    globusconnect -stop
    exit 1
fi

task_id=$(globus transfer "$SOURCE_EP" "$DEST_EP" --label "$SLURM_JOB_NAME" | tail -1 | awk '{print $3}')
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  The transfer of data in could not be started."
    globusconnect -stop
    exit 1
fi

# wait util the task done.
output=$(globus task wait $task_id)
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  The transfer of data was failed."
    globus task cancel $task_id
    globusconnect -stop
    exit 1
fi

# Check if the delete option is set
if [[ "$2" == "DELETE" ]]; then
    task_id=$(globus rm --recursive $SOURCE_EP |& awk '{print $6}' | sed -e "s/\"//g")
    globus task wait $task_id
fi

# done successfully
source deactivate globus-cli
globusconnect -stop
exit 0

The user doesn't have to edit this file. But in submitJob.sh, the user should set the keep or delete option.

Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]

The first parameter [DOWNLOAD | UPLOAD], which

  • When DOWNLOAD is set, it downloads data from Box to Cypress.
  • When UPLOAD is set, it uploads data from Cypress to Box.

The second parameter [KEEP | DELETE], which

  • When KEEP is set, data is kept.
  • When DELETE is set, it deletes the source directory. If DOWNLOAD is set, it deletes files in Box. If UPLOAD is set, it deletes files in Cypress.

Computing Script

computing.sh

#!/bin/bash
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20

# cd to working directory
cd ${CYPRESS_WORK_DIR}
pwd

# module load ... computing something
touch Results.txt
sleep 5

#done
exit 0

This is the main computing script, which the user has to edit. It doesn't have to use CentOS7 nodes.

How to submit a job

On login node, submitJob.sh, transferData.sh, and computing.sh must be in the same directory.

sh ./submitJob.sh

This submits three jobs. The first job downloads data from Box. The second job depends on the first job and performs the computing task. The third job depends on the second job and uploads the results to Box.

Last modified 15 hours ago Last modified on 04/04/25 07:59:48
Note: See TracWiki for help on using the wiki.