An end-to-end script utilizing the Funannotate fungal annotation pipeline to functionally annotate fungal genomes. Code was optimized to run on the NIH Biowulf cluster to annotate a Candida auris isolate's genome as part of my summer work in an NHGRI lab.
See funannotate documentation: https://github.com/nextgenusfs/funannotate
If you want to install using Docker (not recommended due to lack of GeneMark): https://funannotate.readthedocs.io/en/latest/install.html
Issues with Conda on HPC?: nextgenusfs/funannotate#423
GeneMark-ES error? nextgenusfs/funannotate#242
We first need to create private python environment & install miniconda installer (Just Once if Miniconda not installed)
#HPC info: https://hpc.nih.gov/apps/python.html#envs
#!/usr/bin/bash
cd /data/$USER
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
mkdir -p /scratch/$USER/temp
TMPDIR=/scratch/$USER/temp bash Miniconda3-latest-Linux-x86_64.sh -p /data/$USER/conda -b
rm Miniconda3-latest-Linux-x86_64.sh
#get readyy to intall with conda
source /data/$USER/conda/etc/profile.d/conda.sh
conda activate base
conda update conda
which conda # It should be $HOME/anaconda3/condabin/conda
# it's: /data/$USER/conda/bin/conda
#### define home and make a fungap directory -- probably should make this direrctory somewhere in _$PATH_ like bin
HOME=/home/$USER
mkdir $HOME/FunGAP
cd $HOME/FunGAP
#Install Mamba package manager (faster!)
conda install mamba -n base -c conda-forge
# Create FunGAP environment and install dependencies using Mamba
conda create -y -n fungap
conda activate fungap
mamba install \
braker2=2.1.5 trinity=2.12.0 repeatmodeler=2.0.1 hisat2=2.2.1 pfam_scan=1.6 busco=5.1.2 \
-c bioconda -c conda-forge
The config/ directory from AUGUSTUS can be accessed with the variable AUGUSTUS_CONFIG_PATH
.
BRAKER2 requires this directory to be in a writable location, so if that is not the case, copy this directory to a writable location, e.g.:
cp -r ~/annotation
export AUGUSTUS_CONFIG_PATH=~/annotation
Due to license and distribution restrictions, GeneMark, GenomeThreader and ProtHint should be additionally installed for BRAKER2 to fully work.
These packages can be either installed as part of the BRAKER2 environment, or the PATH variable should be configured to point to them.
The GeneMark key should be located in /home/$USER/.gm_key and GENEMARK_PATH
should include the path to the GeneMark executables.
pip install biopython bcbio-gff markdown2 matplotlib
cpanm YAML Hash::Merge Logger::Simple Parallel::ForkManager MCE::Mutex Thread::Queue threads
conda deactivate
conda create -y -n maker
conda activate maker
mamba install maker=3.01.03 -c bioconda -c conda-forge
Download FunGAP using GitHub clone. Suppose we are installing FunGAP in your $HOME
directory, but you are free to change the location. $FUNGAP_DIR
is going to be your FunGAP installation directory.
cd $HOME # or wherever you want
git clone https://github.com/CompSynBioLab-KoreaUniv/FunGAP.git
export FUNGAP_DIR=$(realpath FunGAP/)
# You can put this export command in the your .bashrc file
# so that you don't need to type every time you run the FunGAP ***************TO DO
mkdir -p $FUNGAP_DIR/db/pfam
cd $FUNGAP_DIR/db/pfam
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
gunzip Pfam-A.hmm.gz Pfam-A.hmm.dat.gz
conda activate fungap
hmmpress Pfam-A.hmm # HMMER package (would be automatically installed in the above Anaconda step)
Download: http://exon.gatech.edu/GeneMark/license_download.cgi
Pick: GeneMark-ES/ET/EP ver 4.65_lic LINUX 64
Transfer the software AND BE SURE TO GET THE KEY!!! to biowulf: $FUNGAP_DIR/external
/
mkdir $FUNGAP_DIR/external/
mv gmes_linux_64.tar.gz gm_key_64.gz $FUNGAP_DIR/external/ # Move your downloaded files to this directory
cd $FUNGAP_DIR/external/
tar -zxvf gmes_linux_64.tar.gz
#gunzip gm_key_64.gz #not sure why this was commented out, but I needed to run gunzip to use the next cp
cp gm_key_64 ~/.gm_key
#perl gmes_petap.pl to check proper installation (should show help menu)
GeneMark forces to use /usr/bin/perl
instead of conda-installed perl. You can change this by running change_path_in_perl_scripts.pl script
.
cd $FUNGAP_DIR/external/gmes_linux_64/
perl change_path_in_perl_scripts.pl "/usr/bin/env perl"
test path change by assessing shebang (#!) line
less $FUNGAP_DIR/external/gmes_linux_64/gmes_petap.pl
cd $FUNGAP_DIR/external/gmes_linux_64/
perl ./gmes_petap.pl #note you have to include perrl before the script name
conda activate fungap
cd $(dirname $(which RepeatMasker))/../share/RepeatMasker
# ./configure command will download required databases
echo -e "\n2\n$(dirname $(which rmblastn))\n\n5\n" > tmp && ./configure < tmp
It should look like this
ls $(dirname $(which RepeatMasker))/../share/RepeatMasker/Libraries
# Artefacts.embl Dfam.hmm RepeatAnnotationData.pm RepeatMasker.lib.nin RepeatPeps.lib RepeatPeps.lib.psq
# CONS-Dfam_3.0 README.meta RepeatMasker.lib RepeatMasker.lib.nsq RepeatPeps.lib.phr RepeatPeps.readme
# Dfam.embl RMRBMeta.embl RepeatMasker.lib.nhr RepeatMaskerLib.embl RepeatPeps.lib.pin taxonomy.dat
This script allows users to set and test (by --help command) all the dependencies. If this script runs without any issue, you are ready to run FunGAP!
cd $FUNGAP_DIR
conda activate maker
export MAKER_DIR=$(dirname $(which maker))
echo $MAKER_DIR # /home/ubuntu/anaconda3/envs/maker/bin
conda activate fungap
python ./set_dependencies.py \
--pfam_db_path db/pfam/ \
--genemark_path external/gmes_linux_64/ \
--maker_path ${MAKER_DIR}
If running on biowulf, recommended settings.
screen
sinteractive --mem=96g --cpus-per-task 8 --time 36:00:00
conda activate base
conda install -n base mamba
Then use mamba as drop in replacment & install funannotate
mamba create -n funannotate funannotate
#start up conda ENV
conda activate funannotate
mamba install \
funannotate \
-c bioconda -c conda-forge
echo "export FUNANNOTATE_DB=/your/path" > ~/funannotate.sh
echo "unset FUNANNOTATE_DB" > ~/funannotate.sh
echo $PATH
Make a copy of the gemes_linux executable in a directory in your path: note may want to change funGap to access this new target_dierctory
cp -r ~FunGAP/external/gmes_linux_64 /home/$USER/bin
export PATH=/home/$USER/bin/gmes_linux_64:$PATH
export GENEMARK_PATH=/home/$USER/bin/gmes_linux_64 #'funannotate check' runs when both of these paths are set
Issues with GeneMark in the pipeline were the msot prevelant. Common things to check include.
- all dependencies are downloaded: see
/your/path/gmes_linux_64/README.GeneMark-ES-suite
- perl path is set:
perl change_path_in_perl_scripts.pl "/usr/bin/env perl"
- run
check_install.bash
and/orperl gmes_petap.pl
to see if GM is operating
funannotate check --show-versions #signalp & eggmapper can be loaded from biowulf
funannotate setup -d $HOME/funannotate_db -f -w -l
export FUNANNOTATE_DB=/home/$USER/funannotate_db #location of databse: /home/$USER/funannotate_db
Run a test annotation
funannotate test -t all --cpus 8 #took me ~3 hrs
#!/usr/bin/bash
set -e
sinteractive --mem=96g --cpus-per-task 8 --time 36:00:00
source /data/$USER/conda/etc/profile.d/conda.sh
conda activate base
conda activate funannotate
mkdir /data/$USER/Files4RF
# directory I stored assembly data is in /data/$USER/Files4RF
Retrieved from shared lab storage
cp -r /data/$USER/data/caur_genomes/Caur_3166.Tw_acuzr.Nano.fasta /data/$USER/Files4RF
cd /data/$USER/Files4RF
export PATH=/home/$USER/bin/gmes_linux_64:$PATH
export GENEMARK_PATH=/home/$USER/bin/gmes_linux_64
export TRINITYHOME=/data/$USER/conda/envs/funannotate/opt/trinity-2.8.5
#re-run to get funannotate species
funannotate setup -d $HOME/funannotate_db -f -w -l
export FUNANNOTATE_DB=/home/$USER/funannotate_db
cd /data/$USER/Files4RF
Clean repetitive contigs. This step may not be necessary for all .fa files
funannotate clean -i Caur_3166.Tw_acuzr.Nano.fasta -o Caur_3166.Tw_acuzr.Nano_clean.fasta
funannotate sort -i Caur_3166.Tw_acuzr.Nano_clean.fasta -o Caur_3166.Tw_acuzr.Nano_clean_sort.fasta
funannotate mask -i Caur_3166.Tw_acuzr.Nano_clean_sort.fasta -o Caur_3166.Tw_acuzr.Nano_clean_sort_mask.fasta
Using in-house data
echo "Retrive & train RNA Seq data"
# If Novaseq files are stored in shared storage, pull & unzip files (example files pulled used below)
module load signalp
module load eggNOG-mapper
funannotate train -i Caur_3166.Tw_acuzr.Nano_clean_sort_mask.fasta -o fun \
--left Caur007.nohuman.R1.fastq Caur008.nohuman.R1.fastq Caur009.nohuman.R1.fastq \
--right Caur007.nohuman.R2.fastq Caur008.nohuman.R2.fastq Caur009.nohuman.R2.fastq \
--single Caur007.nohuman.S.fastq Caur008.nohuman.S.fastq Caur009.nohuman.S.fastq \
--jaccard_clip --species "Candida auris" --stranded RF\ #check strandedness info if using RNASeq libraries
--strain 3166_m3 --cpus 36
module unload funannotate signalp eggNOG-mapper python/3.7
Keeping these modules loaded somehow conflcits with 'funannotate predict'
echo "step 3: Predict"
funannotate predict -i Caur_3166.Tw_acuzr.Nano_clean_sort_mask.fasta -o fun \
--species "Candida auris" --strain 3166_m3 --cpus 36 --genemark_mode ES
#training parameters: fun/predict_results/candida_auris_3166_m3.parameters.json
#add species parameters to database
funannotate species -s candida_auris_3166_m3 -a fun/predict_results/candida_auris_3166_m3.parameters.json
echo "step 4: Update"
funannotate update -i fun --cpus 36
echo "step 5: Protein Programs"
#Interproscanc
module load interproscan
cp -r /usr/local/apps/interproscan /data/$USER/Files4RF
cd /data/$USER/Files4RF
funannotate iprscan -i fun --cpus 36 -m local --iprscan_path /data/$USER/Files4RF/interproscan/5.42-78.0/interproscan_app
Results are here fun/annotate_misc/iprscan.xml
This will hang a long time at 0%, but should finish running after a few hours
module load eggNOG-mapper
cp -r /usr/local/apps/eggNOG-mapper /data/$USER/Files4RF
#Signalp
module load signalp
funannotate remote -i fun -m antismash -e [email protected]
funannotate remote -i fun -m phobius -e [email protected]
#usually doesn't run for some reason - needs to be resolved
echo "step 6: Annotate"
funannotate annotate -i fun --cpus 36
GFF3 data to be visualized in R.