The TOPMed RNA-Seq pipeline was converted to CWL for a deliverable to have a CWL pipeline available through a public Tool Registry Service. Specifically, this workflow is available through Dockstore.org.
This document describes team Helium's implimentation of the TOPMed RNA-seq pipeline as described in commit b65c22b. The CWL Workflow is registered publicly on Dockstore here. This CWL workflow has 4 components described below.
A checker workflow registered on Dockstore is also available to verify operation of this pipeline. See information here.
The scripts and settings used for the TOPMed MESA RNA-seq pilot match commit 725a2bc, packaged here.
The intended audiance is any scientist familiar with RNA-seq analysis wishing to run RNA-seq analysis on the TOPMed public access data.
Run the pipeline locally with small test input files. Creating these sample input files is described here.
- Dockstore CLI, CWLTool, Git, Git LFS and Docker should be installed.
- Clone this GitHub repository:
git clone https://github.com/heliumdatacommons/cwl_workflows.git
- Decompress sample files.
./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
- Use this input file or edit the file paths based on your local machine paths.
- Run the workflow with CWLTool.
cwltool topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl \ topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
A checker workflow for the TOPMed RNA-seq pipeline is published on Dockstore here. It is described in more detail in this README.md
The sample data sets intended to be used as input are available through this BioProject.
- Direct DataSets link.
Creating downsampled datasets for testing is described here.
OUTPUTS
describes the files generated by the TOPMed RNA-Seq pipeline, for each sample.
- Alignment: STAR 2.5.3a
- STAR CWL File
- Python script ran by CWL file in Docker container: run_STAR.py
- INPUT: STAR Index and sample FASTQ's. See example input file.
- See here to create STAR Index
- OUTPUT: Aligned RNA-seq reads in BAM format.
- Post-processing: Picard 2.9.0 MarkDuplicates
- Picard MarkDuplicates CWL File
- Python script ran by CWL file in Docker container: run_MarkDuplicates.py
- INPUT: Aligned BAM file from STAR. See example input
- OUTPUT: Marked duplicates BAM file.
- Transcript quantification: RNA-SeQC 1.1.9
- RNA-SeQC CWL File
- Python script ran by CWL file in Docker container: run_rnaseqc.py
- INPUT: Genome FASTA, GTF file, Aligned BAM file from STAR. See example input
- OUTPUT:
- Transcript-level expression quantifications, provided as TPM, expected read counts, and isoform percentages.
- Standard quality control metrics derived from the aligned reads.
- Gene quantification and quality control: RSEM 1.3.0
- RSEM CWL File
- Python script ran by CWL file in Docker container: run_RSEM.py
- INPUT: RSEM refernce files, BAM with reads aligned to transcriptome from STAR. See example input
- See here to create RSEM refernce directory.
- OUTPUT: Gene-level expression quantifications based on a collapsed version of a reference transcript annotation, provided as read counts and TPM.
- Utilities: SAMtools 1.6 and HTSlib 1.6
Many other software packages are available to perform similar funcionality as this pipeline. For deatiled information on RNA-seq analysis steps and other software options, please see A survey of best practices for RNA-seq data analysis.
Currently, republishing the GTEx pipeline Docker container on Docker Hub.
- Original: Dockerfile
- Local: Dockerfile
- Docker Hub Link
Obtaining docker image.
- Docker should be installed. See here if not.
- Pull the image from Docker Hub
docker pull heliumdatacommons/topmed-rnaseq:latest
The following steps assume:
-
You have downloaded the following files:
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz $ gunzip gencode.v26.annotation.gtf.gz $ wget https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz $ tar -xzf Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz
-
You have obtained the Docker container described here
Create the index file using samtools faidx
.
~/input_files
contains the Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta
file.
docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
samtools faidx /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta
Create the dictionary file using Picard CreateSequenceDictionary.
~/input_files
contains the Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta
file.
docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
java -jar /opt/picard-tools/picard.jar CreateSequenceDictionary \
R=/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
O=/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.dict
- Create
.fai
and.dict
file for Genome FASTA (both described above). - GTF file, Genome FASTA file,
.fai
and.dict
should all be in the same directory. Use this directoy as a volume mount when running docker. We usedinput_files
below. - Run the following command:
docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \ STAR --runMode genomeGenerate \ --genomeDir /input_files/star_index \ --genomeFastaFiles /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \ --sjdbGTFfile /input_files/gencode.v26.annotation.gtf \ --sjdbOverhang 100 --runThreadN 10
- Upon completion, your STAR Index will be in the
~/input_files/star_index
directory.
- Create
.fai
and.dict
file for Genome FASTA (both described above). - GTF file, Genome FASTA file,
.fai
and.dict
should all be in the same directory. Use this directoy as a volume mount when running docker. - Create RSEM reference using
rsem-prepare-reference
:
docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq:latest \
rsem-prepare-reference --num-threads 4 \
--gtf /input_files/gencode.v26.annotation.gtf \
/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
/input_files/rsem_reference
- Upon completion, the RSEM reference directory will be in the
~/input_files/rsem_reference
directory.