Installing and Setup WGSA in a local directory on Cypress
This instruction is based on this page and adapted for Cypress.
Decide a folder dedicated for the pipeline, for example '/lustre/project/group/WGSA'.
Setup an environment variable and create workspaces as
export WGSA_DIR=/lustre/project/group/WGSA mkdir $WGSA_DIR cd $WGSA_DIR mkdir work mkdir tmp chmod 777 work chmod 777 tmp
Create a space for ANNOVAR,
mkdir $WGSA_DIR/annovar2019Oct24
Download the ANNOVAR main package from here. The package comes as annovar.latest.tar.gz, save it to $WGSA_DIR/annovar2019Oct24. Unzip it.
cd $WGSA_DIR/annovar2019Oct24 tar -zxvf annovar.latest.tar.gz
Download RefSeq and Ensembl gene models for ANNOVAR:
cd $WGSA_DIR/annovar2019Oct24/annovar perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/ perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/ perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar knownGene humandb/ perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/ perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar ensGene humandb/ perl annotate_variation.pl -buildver hg38 -downdb -webfrom annovar knownGene humandb/
Install SnpEff (required for annotating indels with SnpEff or annotating SNVs with SnpEff on-the-fly) Download SnpEff v4.3t main package and save the zip file to $WGSA_DIR/snpeff:
mkdir $WGSA_DIR/snpeff cd $WGSA_DIR/snpeff wget http://sourceforge.net/projects/snpeff/files/snpEff_v4_3t_core.zip unzip snpEff_v4_3t_core.zip
To use a newer version of JavaSDK, you have to login to a computing node.
Start a interactive session:
idev -c 1 -t 4
It will take more than one hour. See here for more about 'idev'.
Once you get to a computing node, make sure your corrent directory is $WGSA_DIR/snpeff
Download RefSeq and Ensembl gene models for SnpEff:
module load java-openjdk/1.8.0 cd snpEff java -jar snpEff.jar download -v hg19 java -jar snpEff.jar download -v GRCh37.75 java -jar snpEff.jar download -v hg38 java -jar snpEff.jar download -v GRCh38.86
Exit from the computing node:
exit
Install htslib, which is required for VEP API.
mkdir $WGSA_DIR/htslib cd $WGSA_DIR/htslib wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 tar -vxjf htslib-1.9.tar.bz2 cd htslib-1.9 make prefix=$WGSA_DIR/htslib install
Setup the environmental variables
export PATH=$WGSA_DIR/htslib/bin:$PATH export CPATH=$WGSA_DIR/htslib/include:$CPATH export LD_LIBRARY_PATH=$WGSA_DIR/htslib/lib:$LD_LIBRARY_PATH
Install VEP (required for annotating indels with VEP or annotating SNVs with VEP on-the-fly)
Download VEP 94 main package and save it to $WGSA_DIR/vep:
mkdir $WGSA_DIR/vep cd $WGSA_DIR/vep wget https://github.com/Ensembl/ensembl-vep/archive/release/94.zip unzip 94.zip
Install VEP API to /WGSA/vep and download RefSeq and Ensembl gene models to $WGSA_DIR/.vep
cd $WGSA_DIR/vep/ensembl-vep-release-94/ mkdir $WGSA_DIR/.vep export DEST_DIR=$WGSA_DIR export PERL5LIB=$WGSA_DIR perl INSTALL.pl -c $WGSA_DIR/.vep --ASSEMBLY GRCh37
Go through the steps of the installing process and following the guidance at http://useast.ensembl.org/info/docs/tools/vep/script/vep_tutorial.html. When being asked for the cache files, choose “242 : homo_sapiens_merged_vep_94_GRCh37.tar.gz”. When being asked for fasta files, choose “27 : homo_sapiens”. When being asked for the plugins, choose "7:LOF". The fasta file downloading is required for the current version of WGSA.
*This takes very long time…
perl INSTALL.pl -c $WGSA_DIR/.vep --ASSEMBLY GRCh38
When being asked for the cache files, choose "243 : homo_sapiens_merged_vep_94_GRCh38.tar.gz". When being asked for fasta files, choose “54: homo_sapiens”. When being asked for the plugins, choose "n" as LOF has already been installed.
*This takes very long time…
Change the permissions for these directories…
chmod 777 $WGSA_DIR/.vep/Plugins chmod 777 $WGSA_DIR/.vep/homo_sapiens/94_GRCh37 chmod 777 $WGSA_DIR/.vep/homo_sapiens/94_GRCh38
Install LOFTEE LOF plugin for VEP API
cd $WGSA_DIR/.vep/Plugins wget https://github.com/konradjk/loftee/archive/v0.1.1-beta.zip unzip -j v0.1.1-beta.zip rm v0.1.1-beta.zip
Download the pipeline programs and other resources
cd $WGSA_DIR wget http://web.corral.tacc.utexas.edu/WGSAdownload/WGSA085.class mkdir $WGSA_DIR/resources cd $WGSA_DIR/resources wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/javaclass/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/hg19/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/hg38/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/precomputed/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/SpliceAI/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GRASP/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/human_ancestor_GRCh37_e71/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/Neandertal/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GWAS_catalog/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GenoCanyon/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/clinvar/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*" wget http://web.corral.tacc.utexas.edu/WGSAdownload/resources/GeneHancer/ --recursive --continue --timestamping --no-host-directories --cut-dirs=2 --no-parent --reject="index.html*"
Guidance for using external resources (COSMIC, SPIDEX, CADD indel, dbNSFP) can be found here.
Procedure to run on Cypress
1.Prepare input files
Two input files are needed. One is a variant file and the other is a configuration/setting file. The standard variant file is a plain text format file with TAB-delimited columns (tsv format).
An example of variant file, 'clinvar_subset.txt' can be downloaded here.
A setting/configuration file is a plain text format file, in which the users provide information for the name of the input file, name of the output file, directory to various resources and options for annotation. Example template files can be found here.
To run the pipeline on a local machine, the directories settings (line 3 to 9) shall be modified to reflect the absolute paths to the corresponding directories on the local machine.
input file name: clinvar_subset.txt #name of the input file output file name: clinvar_subset.txt.annotated #name of the output file resources dir: /lustre/project/hpcstaff/fuji/WGSA/resources/ #the location of the resouces folder annovar dir: /lustre/project/hpcstaff/fuji/WGSA/annovar2019Oct24/annovar/ #the locatio n of the ANNOVAR annotate_variation.pl snpeff dir: /lustre/project/hpcstaff/fuji/WGSA/snpeff/snpEff/ #the location of the snpEff snpEff.jar vep dir: /lustre/project/hpcstaff/fuji/WGSA/vep/ensembl-vep-release-94/ #the location of the VEP va riant_effect_predictor.pl .vep dir: /lustre/project/hpcstaff/fuji/WGSA/.vep/ #the location of the .vep folder tmp dir: /lustre/project/hpcstaff/fuji/WGSA/tmp/ #the location of the tmp folder, used for VEP on-the-fly annotation work dir: /lustre/project/hpcstaff/fuji/WGSA/work/ #the location of the working folder, used for storing intermediate files retain intermediate file: b #supported option: snp or s, indel or i, both or b, no or n ANNOVAR/Ensembl: b #supported option: snp or s, indel or i, both or b, no or n ANNOVAR/RefSeq: b #supported option: snp or s, indel or i, both or b, no or n ANNOVAR/UCSC: b #supported option: snp or s, indel or i, both or b, no or n
In the example above, $WGSA_DIR='/lustre/project/hpcstaff/fuji/WGSA'.
2. Upload input files
Upload two input files to Cypress. You can place them in any directory. Here let's create a directory 'WGSA_TEST' under '/lustre/project/hpcstaff/fuji/'
mkdir /lustre/project/hpcstaff/fuji/WGSA_TEST cd /lustre/project/hpcstaff/fuji/WGSA_TEST
See here for the file transfer.
3. Create the pipeline slurm job script
Example of Slurm job script is:
#!/bin/bash #SBATCH --job-name=WGSA # Job Name #SBATCH --output=WGSA.out # File in which to store job output #SBATCH --error=WGSA.err # File in which to store job error messages #SBATCH --qos=normal # Quality of Service (like a queue in PBS) #SBATCH --time=0-10:00:00 # Wall clock time limit in Days-HH:MM:SS #SBATCH --nodes=1 # Node count required for the job #SBATCH --ntasks-per-node=1 # Number of tasks to be launched per Node #SBATCH --cpus-per-task=20 # Number of cores per task #SBATCH --mem=128000 # Max RAM request 128GByte # Module load module load java-openjdk/1.8.0 # Set the dirctry where WGSA installed export WGSA_DIR=/lustre/project/hpcstaff/fuji/WGSA # Set 'setting/configuration file' SETTING_FILE=test1000g-hg38-WGSA085.EC2.setting # Setup echo "Understand" | java -cp $WGSA_DIR WGSA085 $SETTING_FILE -m 128 -t 20 -v hg19 # Run job sh ./${SETTING_FILE}.sh
Save it with a name, for example 'Slurmscript' on the same directory where two input files are placed.
4. Run the pipeline job script
sbatch Slurmscript
See here about SLURM. It will take about 7 hours to finish.