-
Notifications
You must be signed in to change notification settings - Fork 58
Gathering input files
There are dozens of input files needed for these pipelines. These include things like reference genomes, aligner indices, gene annotations, and known variants. This guide aims to be a comprehensive walkthrough of where to find those files and/or how to create them. Code snippets are included for both mouse and human, using GRCh38/GRCm38 and Ensembl version 95. (We welcome contributions of additional species)
If you just want to grab a pre-built blob containing all of these files, visit the links below (work in progress)
To create them yourself, follow the links below, each labelled with key pipelines for which they're needed
- - Alignment
- - Germline variant calling
- - Somatic variant calling (exome/WGS)
- - RNAseq (bulk)
- - Bisulfite
Before running any snippets, make a directory where you'd like this cache to be stored, and set the BASEDIR
variable to point to it:
BASEDIR=/path/to/annotation_data_grch38_ens95
human
Note that we use a slightly modified version of the GRCh38 reference which N-masks a single contig. That contig contained a duplication of the U2AF1 gene locus, resulting in multi-mapping and loss of variant calls in that important cancer gene. The coordinate system is preserved and alignments should be otherwise unaffected.
mkdir -p $BASEDIR/reference_genome
wget -O $BASEDIR/reference_genome/all_sequences.fa.gz https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/reference_GRCh38/all_sequences.fa.gz
gunzip $BASEDIR/reference_genome/all_sequences.fa.gz
mouse
mkdir -p $BASEDIR/reference_genome
wget -O $BASEDIR/reference_genome/all_sequences.fa.gz https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/reference_GRCm38/all_sequences.fa.gz
gunzip $BASEDIR/reference_genome/all_sequences.fa.gz
human or mouse
From docker image `mgibio/samtools-cwl:1.0.0`
/opt/samtools/bin/samtools faidx $BASEDIR/reference_genome/all_sequences.fa
human or mouse
From docker image `mgibio/picard-cwl:2.18.1`
java -jar /opt/picard/picard.jar CreateSequenceDictionary R=$BASEDIR/reference_genome/all_sequences.fa O=$BASEDIR/reference_genome/all_sequences.dict
human or mouse
perl -ne 'print "$1\t$2\n" if $_ =~ /SN:([^\s]+).+LN:(\d+)/' $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/reference_genome/all_sequences.genome
human
Select the most recent [docker image from your desired ensembl version](https://hub.docker.com/r/ensemblorg/ensembl-vep/tags). Our current release uses `ensemblorg/ensembl-vep:release_95.0` Inside that container, run the following (mount the appropriate volumes if needed):
mkdir $BASEDIR/vep_cache
/usr/bin/perl /opt/vep/src/ensembl-vep/INSTALL.pl --CACHEDIR $BASEDIR/vep_cache --AUTO cf --SPECIES homo_sapiens --ASSEMBLY GRCh38
If for some reason the desired cache version and docker image version in use do not match, add --CACHE_VERSION <version>
mouse
Select the most recent [docker image from your desired ensembl version](https://hub.docker.com/r/ensemblorg/ensembl-vep/tags). Our current release uses `ensemblorg/ensembl-vep:release_95.0` Inside that container, run the following (mount the appropriate volumes if needed):
mkdir $BASEDIR/vep_cache
/usr/bin/perl /opt/vep/src/ensembl-vep/INSTALL.pl --CACHEDIR $BASEDIR/vep_cache --AUTO cf --SPECIES mus_musculus --ASSEMBLY GRCm38
If for some reason the desired cache version and docker image version in use do not match, add --CACHE_VERSION <version>
human
mkdir -p $BASEDIR/rna_seq_annotation
cd $BASEDIR/rna_seq_annotation
wget -O $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz
gunzip Homo_sapiens.GRCh38.95.gtf.gz
The reference uses "chr" prefixes (chr1, chr2, chr3), while the Ensembl gtf does not (1, 2, 3), so we need to convert the gtf:
cd $BASEDIR/rna_seq_annotation/
wget https://raw.githubusercontent.com/chrisamiller/convertEnsemblGTF/master/convertEnsemblGTF.pl
mv $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig
perl convertEnsemblGTF.pl $BASEDIR/reference_genome/all_sequences.dict $BASEDIR/vep_cache/homo_sapiens/95_GRCh38/chr_synonyms.txt
$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf && rm -f $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf.orig
mouse
mkdir -p $BASEDIR/rna_seq_annotation
wget -O $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf.gz ftp://ftp.ensembl.org/pub/release-95/gtf/mus_musculus/Mus_musculus.GRCm38.95.gtf.gz
gunzip $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf.gz
human
Note that we're not going to convert these coordinates from [1,2,3] to [chr1,chr2,chr3], since Kallisto doesn't include coordinates in it's output. If pseudobams from kallisto are desired, that conversion would need to be done.
wget -O $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.gz http://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
mouse
wget -O $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.gz ftp://ftp.ensembl.org/pub/release-95/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
human or mouse
From docker image: `mgibio/alignment_helper-cwl:1.0.0`
mkdir -p $BASEDIR/aligner_indices/bwamem_0.7.15
cd $BASEDIR/aligner_indices/bwamem_0.7.15
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do
ln -s $BASEDIR/reference_genome/$i .
done
/usr/local/bin/bwa index all_sequences.fa
human
In the `trinityctat/starfusion` docker container:
mkdir -p $BASEDIR/aligner_indices/star_fusion
cd $BASEDIR/aligner_indices/star_fusion
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --CPU 10 --genome_fa $BASEDIR/reference_genome/all_sequences.fa --gtf $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf --pfam_db current --dfam_db human --output_dir $BASEDIR/aligner_indices/star_fusion"
mouse
In the `trinityctat/starfusion` docker container:
mkdir -p $BASEDIR/aligner_indices/star_fusion
cd $BASEDIR/aligner_indices/star_fusion
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl --CPU 10 --genome_fa $BASEDIR/reference_genome/all_sequences.fa --gtf $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf --pfam_db current --dfam_db mouse --output_dir $BASEDIR/aligner_indices/star_fusion"
human or mouse
From docker image: mgibio/rnaseq
mkdir -p $BASEDIR/aligner_indices/hisat_2.1.0
cd $BASEDIR/aligner_indices/hisat_2.1.0
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do
ln -s $BASEDIR/reference_genome/$i .
done
echo "This is a placeholder file to pass to CWL to refer to all the index files and be a parameter to HISAT2." >GRCm38
/opt/hisat2/hisat2-2.1.0/hisat2-build -p 8 all_sequences.fa GRCm38"
human or mouse
From docker image: `mgibio/biscuit:0.3.8`
mkdir -p $BASEDIR/aligner_indices/biscuit_0.3.8
cd $BASEDIR/aligner_indices/biscuit_0.3.8
for i in all_sequences.fa all_sequences.fa.fai all_sequences.dict all_sequences.genome;do
ln -s $BASEDIR/reference_genome/$i .
done
/usr/bin/biscuit index all_sequences.fa
human
dbSNP/indels from the Mills/1000G/etc datasets for use with GATK/Mutect from the `google/cloud-sdk` image:
mkdir $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz $BASEDIR/known_variants
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi $BASEDIR/known_variants
human
Variants from the [Database of Canonical Mutations](http://docm.info) used for hotspot checking in somatic pipelines
wget -O $BASEDIR/known_variants/docm.vcf https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/docm.GRCh38.vcf
human
This command downloads the V3 of the genome [Gnomad Database](https://gnomad.broadinstitute.org) which has the allele frequencies from 70k WGS samples aligned to GRCh38. This file is ~236gb.
wget -O $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz
wget -O $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz.tbi https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz.tbi
lite version
Can create a lite version containing the population allele frequencies from the full vcf using using bcftools. Tested using bcftools version 1.9.bcftools annotate -x ^INFO/AF,INFO/AF_afr,INFO/AF_amr,INFO/AF_asj,INFO/AF_eas,INFO/AF_fin,INFO/AF_nfe,INFO/AF_oth,INFO/AF_sas --threads $THREADS --output-type z -o $BASEDIR/known_variants/gnomad-lite.genomes.r3.0.sites.vcf.gz $BASEDIR/known_variants/gnomad.genomes.r3.0.sites.vcf.gz
human
wget -O $BASEDIR/known_variants/somalier_GRCh38.vcf.gz https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz
human
From docker image `mgibio/rnaseq`
kallisto index -i $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.kallisto.idx $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.cdna.all.fa.gz
mouse
From docker image `mgibio/rnaseq`
kallisto index -i $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.kallisto.idx $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.cdna.all.fa.gz
human
From docker image `quay.io/biocontainers/bioconductor-biomart:2.42.0--r36_0`
cd $BASEDIR/rna_seq_annotation/
Rscript -e 'library(biomaRt);mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host="http://jan2019.archive.ensembl.org/");t2g <- getBM(attributes=c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart=mart);t2g <- t2g[order(t2g$ensembl_gene_id), ]; write.table(t2g,"ensembl95.transcriptToGene.tsv",sep="\t",row.names=F,quote=F)'
mouse
From docker image `quay.io/biocontainers/bioconductor-biomart:2.42.0--r36_0`
cd $BASEDIR/rna_seq_annotation/
Rscript -e 'library(biomaRt);mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl", host="http://jan2019.archive.ensembl.org/");t2g <- getBM(attributes=c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name"), mart=mart);t2g <- t2g[order(t2g$ensembl_gene_id), ]; write.table(t2g,"ensembl95.transcriptToGene.tsv",sep="\t",row.names=F,quote=F)'
human
In docker container: `quay.io/biocontainers/ucsc-gtftogenepred:377--h35c10e6_2`
gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf /dev/stdout | awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.refFlat.txt
mouse
In docker container: `quay.io/biocontainers/ucsc-gtftogenepred:377--h35c10e6_2`
gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf /dev/stdout | awk 'BEGIN { OFS="\t"} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' >$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.refFlat.txt
human
cat $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.ribo_intervals
grep 'gene_biotype "rRNA"' $BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.gtf | awk '$3 == "exon"' | cut -f1,4,5,7,9 | perl -lane '/transcript_id "([^"]+)"/ or die "no transcript_id on $.";print join "\t", (@F[0,1,2,3], $1)' | sort -k1V -k2n -k3n >>$BASEDIR/rna_seq_annotation/Homo_sapiens.GRCh38.95.ribo_intervals
mouse
cat $BASEDIR/reference_genome/all_sequences.dict >$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.ribo_intervals
grep 'gene_biotype "rRNA"' $BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.gtf | awk '$3 == "exon"' | cut -f1,4,5,7,9 | perl -lane '/transcript_id "([^"]+)"/ or die "no transcript_id on $.";print join "\t", (@F[0,1,2,3], $1)' | sort -k1V -k2n -k3n >>$BASEDIR/rna_seq_annotation/Mus_musculus.GRCm38.95.ribo_intervals
human or mouse
mkdir $BASEDIR/miscellaneous
cd $BASEDIR/miscellaneous
wget https://xfer.genome.wustl.edu/gxfer1/project/cancer-genomics/illumina_multiplex.fa
human
cd $BASEDIR/aligner_indices/biscuit_0.3.8
wget http://zwdzwd.io/BISCUITqc/hg38_QC_assets.zip
unzip hg38_QC_assets.zip
mouse
cd $BASEDIR/aligner_indices/biscuit_0.3.8
wget http://zwdzwd.io/BISCUITqc/mm10_QC_assets.zip
unzip mm10_QC_assets.zip