wiki:cypress/GlobusInBarchJob

Version 1 (modified by fuji, 2 days ago) ( diff )

Example Pipeline for Data Transfer Using Globus and Computation in Batch Jobs

Here we consider a pipeline to do the following processes.

  1. Transfer data files from Box to the Cypress Lustre directory.
  2. Perform computation using the data.
  3. Transfer the results to Box.
  4. Delete files in Cypress Lustre.

Scripts

Job Submission Script

submitJob.sh

#
# Pipeline for Data Transfer Using Globus and Computation
#
# Job name
JOB_NAME="COMPUTING1"

# Set path
export BOX_DATA_DIR="/Test/"
export CYPRESS_WORK_DIR="/lustre/project/group/userid/test/"
export BOX_RESULT_DIR="/Test_result/"
#
# Submit a job to transfer data from Box to Cypress
JOB1=`sbatch --job-name=${JOB_NAME}_DL ./transferData.sh DOWNLOAD KEEP | awk '{print $4}'`;
echo $JOB1 "Submitted"

# Submit a job to process data on Cypress
JOB2=`sbatch --job-name=${JOB_NAME} --dependency=afterok:$JOB1 ./computing.sh | awk '{print $4}'`;
echo $JOB2 "Submitted"

# Submit a job to transfer data from Cypress to Box
JOB3=`sbatch --job-name=${JOB_NAME}_UL --dependency=afterok:$JOB2 ./transferData.sh UPLOAD DELETE | awk '{print $4}'`;
echo $JOB3 "Submitted"

JOB_NAME is the job name.

BOX_DATA_DIR is the directory in Box where the source data is stored.

CYPRESS_WORK_DIR is the directory where the downloaded data is stored.

BOX_RESULT_DIR is the directory where results are uploaded in Box.

Data Transfer Script

transferData.sh

#!/bin/bash
#SBATCH --partition=centos7
#SBATCH --qos=long
#SBATCH --time=7-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1

# Check options
if [ $# -ne 2 ]; then
    echo 'Usage: transferData.sh [DOWNLOAD | UPLOAD] [KEEP | DELETE]'
    exit 1
fi

# Check path
if [[ -z "${BOX_DATA_DIR}" ]]; then
    echo "ERROR!  BOX_DATA_DIR isn't set."
    exit 1
fi
if [[ -z "${CYPRESS_WORK_DIR}" ]]; then
    echo "ERROR!  CYPRESS_WORK_DIR isn't set."
    exit 1
fi
if [[ -z "${BOX_RESULT_DIR}" ]]; then
    echo "ERROR!  BOX_RESULT_DIR isn't set."
    exit 1
fi

# Start Globus Connect
module load globusconnectpersonal/3.2.5
globusconnect -start &

# Set up CLI environment
source activate globus-cli

# Obtain local UUID
MY_UUID=$(globus endpoint local-id)
uuid_code=$?
if [ $uuid_code -ne 0 ]; then
    echo "ERROR!  Globus Connect isn't activated."
    globusconnect -stop
    exit 1
fi

# Make the source and destination path
if [[ "$1" == "DOWNLOAD" ]]; then
    SOURCE_EP=$TULANE_BOX:$BOX_DATA_DIR
    DEST_EP=$MY_UUID:$CYPRESS_WORK_DIR
else
    SOURCE_EP=$MY_UUID:$CYPRESS_WORK_DIR
    DEST_EP=$TULANE_BOX:$BOX_DATA_DIR    
fi

# Check logged in to Globus
output=$(globus whoami >/dev/null 2>&1)
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  Not logged in to Globus"
    globusconnect -stop
    exit 1
fi

task_id=$(globus transfer "$SOURCE_EP" "$DEST_EP" --label "$SLURM_JOB_NAME" | tail -1 | awk '{print $3}')
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  The transfer of data in could not be started."
    globusconnect -stop
    exit 1
fi

# wait util the task done.
output=$(globus task wait $task_id)
output_code=$?
if [ $output_code -ne 0 ]; then
    echo "ERROR!  The transfer of data was failed."
    globus task cancel $task_id
    globusconnect -stop
    exit 1
fi

# Check if the delete option is set
if [[ "$2" == "DELETE" ]]; then
    task_id=$(globus rm --recursive $SOURCE_EP |& awk '{print $6}' | sed -e "s/\"//g")
    globus task wait $task_id
fi

# done successfully
source deactivate globus-cli
globusconnect -stop
exit 0

Computing Script

computing.sh

#!/bin/bash
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20

# cd to working directory
cd ${CYPRESS_WORK_DIR}
pwd

# module load ... computing something
touch RES
sleep 5

#done
exit 0

How to submit a job

sh ./SubmitJob.sh
Note: See TracWiki for help on using the wiki.