wiki:cypress/WGSA

Version 12 (modified by fuji, 4 years ago) ( diff )

Installing and Setup WGSA in a local directory on Cypress

This instruction is based on this page and adapted for Cypress.

Decide a folder dedicated for the pipeline, for example '/lustre/project/group/WGSA'.

Setup an environment variable and create workspaces as

export WGSA_DIR=/lustre/project/group/WGSA
mkdir $WGSA_DIR
cd $WGSA_DIR
mkdir work
mkdir tmp
chmod 777 work
chmod 777 tmp

Create a space for ANNOVAR,

mkdir $WGSA_DIR/annovar2019Oct24

Download the ANNOVAR main package from here. The package comes as annovar.latest.tar.gz, save it to $WGSA_DIR/annovar2019Oct24. Unzip it.

cd $WGSA_DIR/annovar2019Oct24
tar -zxvf annovar.latest.tar.gz

Download RefSeq and Ensembl gene models for ANNOVAR:

cd $WGSA_DIR/annovar2019Oct24/annovar
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar knownGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/     
perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar knownGene humandb/    

Install SnpEff (required for annotating indels with SnpEff or annotating SNVs with SnpEff on-the-fly) Download SnpEff v4.3t main package and save the zip file to $WGSA_DIR/snpeff:

mkdir $WGSA_DIR/snpeff
cd $WGSA_DIR/snpeff
wget http://sourceforge.net/projects/snpeff/files/snpEff_v4_3t_core.zip
unzip snpEff_v4_3t_core.zip

To use a newer version of JavaSDK, you have to login to a computing node.

Start a interactive session:

idev -c 1 -t 4

It will take more than one hour. See here for more about 'idev'.

Once you get to a computing node, make sure your corrent directory is $WGSA_DIR/snpeff

Download RefSeq and Ensembl gene models for SnpEff:

module load java-openjdk/1.8.0
cd snpEff
java -jar snpEff.jar download -v hg19
java -jar snpEff.jar download -v GRCh37.75
java -jar snpEff.jar download -v hg38
java -jar snpEff.jar download -v GRCh38.86

Exit from the computing node:

exit

Install htslib, which is required for VEP API.

mkdir $WGSA_DIR/htslib
cd $WGSA_DIR/htslib
wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
tar -vxjf htslib-1.9.tar.bz2
cd htslib-1.9
make prefix=$WGSA_DIR/htslib install

Setup the environmental variables

export PATH=$WGSA_DIR/htslib/bin:$PATH
export CPATH=$WGSA_DIR/htslib/include:$CPATH
export LD_LIBRARY_PATH=$WGSA_DIR/htslib/lib:$LD_LIBRARY_PATH

Install VEP (required for annotating indels with VEP or annotating SNVs with VEP on-the-fly)

Download VEP 94 main package and save it to $WGSA_DIR/vep:

mkdir $WGSA_DIR/vep
cd $WGSA_DIR/vep
wget https://github.com/Ensembl/ensembl-vep/archive/release/94.zip
unzip 94.zip

Install VEP API to /WGSA/vep and download RefSeq and Ensembl gene models to $WGSA_DIR/.vep

cd $WGSA_DIR/vep/ensembl-vep-release-94/
mkdir $WGSA_DIR/.vep
export DEST_DIR=$WGSA_DIR
export PERL5LIB=$WGSA_DIR
perl INSTALL.pl -c $WGSA_DIR/.vep --ASSEMBLY GRCh37

Go through the steps of the installing process and following the guidance at http://useast.ensembl.org/info/docs/tools/vep/script/vep_tutorial.html. When being asked for the cache files, choose “242 : homo_sapiens_merged_vep_94_GRCh37.tar.gz”. When being asked for fasta files, choose “27 : homo_sapiens”. When being asked for the plugins, choose "7:LOF". The fasta file downloading is required for the current version of WGSA.

*This takes very long time…

perl INSTALL.pl -c $WGSA_DIR/.vep --ASSEMBLY GRCh38

When being asked for the cache files, choose "243 : homo_sapiens_merged_vep_94_GRCh38.tar.gz". When being asked for fasta files, choose “54: homo_sapiens”. When being asked for the plugins, choose "n" as LOF has already been installed.

*This takes very long time…

Change the permissions for these directories…

chmod 777 $WGSA_DIR/.vep/Plugins
chmod 777 $WGSA_DIR/.vep/homo_sapiens/94_GRCh37
chmod 777 $WGSA_DIR/.vep/homo_sapiens/94_GRCh38

Install LOFTEE LOF plugin for VEP API

cd $WGSA_DIR/.vep/Plugins
wget https://github.com/konradjk/loftee/archive/v0.1.1-beta.zip
unzip -j v0.1.1-beta.zip
rm v0.1.1-beta.zip

Download the pipeline programs and other resources

cd $WGSA_DIR
wget http://web.corral.tacc.utexas.edu/WGSAdownload/WGSA085.class
mkdir $WGSA_DIR/resources
cd $WGSA_DIR/resources
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/javaclass/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/hg19/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/hg38/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" 
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/precomputed/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/SpliceAI/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GRASP/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/human_ancestor_GRCh37_e71/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/Neandertal/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GWAS_catalog/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GenoCanyon/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/clinvar/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GeneHancer/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"

Guidance for using external resources (COSMIC, SPIDEX, CADD indel, dbNSFP) can be found here.

Procedure to run on Cypress

1.Prepare input files

Two input files are needed. One is a variant file and the other is a configuration/setting file. The standard variant file is a plain text format file with TAB-delimited columns (tsv format).

An example of variant file, 'clinvar_subset.txt' can be downloaded here.

A setting/configuration file is a plain text format file, in which the users provide information for the name of the input file, name of the output file, directory to various resources and options for annotation. Example template files can be found here.

To run the pipeline on a local machine, the directories settings (line 3 to 9) shall be modified to reflect the absolute paths to the corresponding directories on the local machine.

input file name:                    clinvar_subset.txt                #name of the input file
output file name:                   clinvar_subset.txt.annotated             #name of the output file
resources dir:                      /lustre/project/hpcstaff/fuji/WGSA/resources/                                  #the location
 of the resouces folder
annovar dir:                        /lustre/project/hpcstaff/fuji/WGSA/annovar2019Oct24/annovar/                    #the locatio
n of the ANNOVAR annotate_variation.pl
snpeff dir:                         /lustre/project/hpcstaff/fuji/WGSA/snpeff/snpEff/                              #the location
 of the snpEff snpEff.jar
vep dir:                            /lustre/project/hpcstaff/fuji/WGSA/vep/ensembl-vep-release-94/   #the location of the VEP va
riant_effect_predictor.pl
.vep dir:                           /lustre/project/hpcstaff/fuji/WGSA/.vep/                                       #the location
 of the .vep folder
tmp dir:                            /lustre/project/hpcstaff/fuji/WGSA/tmp/                                        #the location
 of the tmp folder, used for VEP on-the-fly annotation
work dir:                           /lustre/project/hpcstaff/fuji/WGSA/work/                                       #the location
 of the working folder, used for storing intermediate files
retain intermediate file:           b                            #supported option: snp or s, indel or i, both or b, no or n
ANNOVAR/Ensembl:                    b                            #supported option: snp or s, indel or i, both or b, no or n
ANNOVAR/RefSeq:                     b                            #supported option: snp or s, indel or i, both or b, no or n
ANNOVAR/UCSC:                       b                            #supported option: snp or s, indel or i, both or b, no or n

In the example above, $WGSA_DIR='/lustre/project/hpcstaff/fuji/WGSA'.

2. Upload input files

Upload two input files to Cypress. You can place them in any directory. Here let's create a directory 'WGSA_TEST' under '/lustre/project/hpcstaff/fuji/'

mkdir /lustre/project/hpcstaff/fuji/WGSA_TEST
cd /lustre/project/hpcstaff/fuji/WGSA_TEST

See here for the file transfer.

3. Create the pipeline slurm job script

Example of Slurm job script is:

#!/bin/bash
#SBATCH --job-name=WGSA       # Job Name
#SBATCH --output=WGSA.out     # File in which to store job output
#SBATCH --error=WGSA.err      # File in which to store job error messages
#SBATCH --qos=normal          # Quality of Service (like a queue in PBS)
#SBATCH --time=0-10:00:00     # Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1             # Node count required for the job
#SBATCH --ntasks-per-node=1   # Number of tasks to be launched per Node
#SBATCH --cpus-per-task=20    # Number of cores per task
#SBATCH --mem=128000          # Max RAM request 128GByte

# Module load
module load java-openjdk/1.8.0 

# Set the dirctry where WGSA installed
export WGSA_DIR=/lustre/project/hpcstaff/fuji/WGSA

# Set 'setting/configuration file'
SETTING_FILE=test1000g-hg38-WGSA085.EC2.setting

# Setup
echo "Understand" | java -cp $WGSA_DIR WGSA085 $SETTING_FILE -m 128 -t 20 -v hg19

# Run job
sh ./${SETTING_FILE}.sh

Save it with a name, for example 'Slurmscript' on the same directory where two input files are placed.

4. Run the pipeline job script

sbatch Slurmscript

See here about SLURM. It will take about 7 hours to finish.

Note: See TracWiki for help on using the wiki.