-
Notifications
You must be signed in to change notification settings - Fork 40
RNA seq
This document explains how to run the piPipes RNA-seq pipeline and how to interpret the output.
This pipeline provides analysis on genes and transposons abundance at the transcriptome level using paired-end RNA-seq reads generated by Next Generation Sequencing.
The RNA-seq pipeline contains two modes: single-sample mode and dual-sample mode.
Note: this pipeline can also be utilized for general gene quantification.
# require internet access
piPipes install -g mm9
# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
# --split-3 split the left and right read from paired-end RNA-seq data
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep1 SRR765641
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep2 SRR765642
# the two libraries are two replicates from the same genotype
piPipes rna
# or
piPipes rna -h
# -l: left read of RNA-seq
# -r: right read of RNA-seq
# -g: genome to use
# -o: output directory
piPipes rna \
-l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
-r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out_SE \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
# Or using single-end mode _1 is in the same direction, so option -L is provided
piPipes rna \
-i Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
-L \
-g mm9 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
piPipes rna \
-i Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out_SE \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
piPipes_debug rna \
-l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
-r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
# -l: left read of RNA-seq
# -r: right read of RNA-seq
# -g: genome to use
# -o: output directory
# -L: ligation based RNA-seq construction method.
# \1 has the same direction as the original RNA.
# this option is by default off, meaning that dUTP based method is by default.
# -c: number of CPUs to use; using more CPU will significantly increase the speed
# -B: rounds of batch algorithm to use in eXpress. Use default unless it takes too long
# Read more in:
# Roberts A and Pachter L (2012). Streaming fragment assignment for real-time
# analysis of sequencing experiments. Nature Methods.
piPipes_debug rna \
-l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
-r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
-g mm9 \
-c 24 \
-B 15 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
The output directory should contain the following directories and files:
rRNA_mapping/
input_read_files/
cufflinks_output/
bigWig/
htseq_count/
genome_mapping/
Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
pdfs/
direct_transcriptome_mapping/
Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
contains the basic statistical information on the library
$ cat Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
total_input_reads: 32594566
rRNA_reads: 722247
genomie_mapper_reads: 26732381
genomie_unique_mapper_reads: 23749430
genomie_multiple_mapper_reads: 2982951
genomie_unmappable_reads: 5139938
rRNA_mapping/
contains log file of rRNA mapping (Bowtie2)
$ cat Zamore.RSQ.amyb_het.testis.14dpp.rRNA.log
32594566 reads; of these:
32594566 (100.00%) were paired; of these:
31872319 (97.78%) aligned concordantly 0 times
722247 (2.22%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
2.22% overall alignment rate
input_read_files/
contains input Fastq file for genome mapping by STAR
# x_rRNA means rRNA mappable sequences have been removed
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.1.fq
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.2.fq
genome_mapping/
contains data for genome mapping
# outputs of STAR. Please see its document for detailed explanation
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.SJ.out.tab
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.progress.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.final.out
# unmappable sequence in Fastq format, might be used to map sequences that
# are not in the assembled genome, like transgene
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Unmapped.out.mate1
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Unmapped.out.mate2
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.STAR.log
# genome alignment file, sorted by COORDINATES
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.bam
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.bam.bai
# genome aligment file in normalized.bedpe format and bed12 format
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.all.normalized.bedpe
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.unique.normalized.bedpe
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.bed12
# normalized bedpe format is a modified BEDPE format (please see BEDTools document for BEDPE)
# briefly, each line represents an alignment for a pair.
# each field, from left to right, represents:
# chromosome of mate1
# start of mate1 (0-based, included)
# end of mate1 (0-based, not included)
# chromosome of mate2
# start of mate2 (0-based, included)
# end of mate2 (0-based, not included)
# name of this pair, different from small RNA pipeline, reads in RNA-seq pipeline are not compiled
# normalized read. If this pair can be mapped to N loci in the genome, this number if 1/N
# Note that we called STAR using --outFilterMultimapScoreRange 1, thus only the best mappers
# are reported
# strand of mate1
# strand of mate2
$ head Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.all.normalized.bedpe
chr15 89135450 89135627 chr15 89135620 89135802 FCC1GJJACXX:6:1101:15553:94247 1 + -
chr5 64702364 64702464 chr5 64702308 64702408 FCC1GJJACXX:6:1101:16706:94222 1 - +
chr5 54045627 54045727 chr5 54045421 54045521 FCC1GJJACXX:6:1101:15621:94056 1 - +
chr9 21440366 21440708 chr9 21440233 21440333 FCC1GJJACXX:6:1101:17366:94052 1 - +
chr3 97943583 97945464 chr3 97943552 97945433 FCC1GJJACXX:6:1101:16755:94111 1 - +
chrM 5706 5806 chrM 5596 5696 FCC1GJJACXX:6:1101:17540:94078 1 - +
chr15 12195313 12195413 chr15 12193559 12195305 FCC1GJJACXX:6:1101:16406:94075 1 - +
chrM 1230 1330 chrM 1176 1276 FCC1GJJACXX:6:1101:16781:94117 1 - +
chr12 70260428 70260528 chr12 70260297 70260397 FCC1GJJACXX:6:1101:17995:94098 0.5 - +
chr12 70462221 70462321 chr12 70462352 70462452 FCC1GJJACXX:6:1101:17995:94098 0.5 + -
# normalized bedpe file with only unique mappers
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.unique.normalized.bedpe
# unique mappers in bed12 format. \1 and \2 occupy two lines
# the strand has been reset to follow the original RNA:
# for dUTP method, \1 has been reversed; for ligation method, \2 has been reversed
$ head -2 Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.bed12
chr10 3000542 3000642 FCC1GJJACXX:6:1202:3559:62122/2 1 + 3000542 3000642 255,0,0 1 100 0
chr10 3000597 3000697 FCC1GJJACXX:6:1202:3559:62122/1 1 + 3000597 3000697 255,0,0 1 100 0
cufflinks_output/
contains the output from Cufflinks
No transposon sequence is included in this GTF file, so the output can be used for general purpose
# piPipes ran cufflinks using gene only GTF without transposon annotation
# as well as --compatible-hits-norm option.
# the depth calculated by cufflinks is used in the entire pipeline to represent depth
# in most of piRNA mutants, transposon reads have huge amount of increase
# using gene compatible reads is less biased than using total reads
# see cufflinks document for more information
skipped.gtf
transcripts.gtf
isoforms.fpkm_tracking
# this document contains FPKM values for annotated gene
genes.fpkm_tracking
Zamore.RSQ.amyb_het.testis.14dpp.cufflinks.log
bigWig/
contains bigWig files for UCSC genome browser.
The signal contains unique mappers only and has been normalized to the library depth, which is calculated by cufflinks using annotation compatible reads.
# Watson and Crick strands are separated
# bedGraph files are intermediates file. they might be removed in the future version
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Crick.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Watson.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Crick.bigWig
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Watson.bigWig
htseq_count/
contains outputs from HTSeq
# HTSeq is an alternative abundance calculation tool of Cufflinks and eXpress
# Different from Cufflinks and eXpress, which use EM-algorithm to assign multi-mappers to different
# transcript isoforms, HTSeq only count unique mappers.
# We uses a annotation comprised of Genes, piRNA cluster and repBase (repeatMasker)
# strict and union are two ways HTSeq supports, please see HTSeq document for more information
# S and AS represent sense and antisense;
# We realized that in dUTP method, ~10% of reads lost their strand information
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.strict.S.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.strict.AS.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.union.S.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.union.AS.out
direct_transcriptome_mapping/
contains output from eXpress
Different from the "genome mapping + Cufflinks/HTSeq" strategy, this method directly aligns input reads to transcriptome and estimates transcript abundance from there.
Same as Cufflinks, eXpress also uses "EM algorithm" for multiple-mappers assignment.
# direct alignment by Bowtie2
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.log
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.bam
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.sorted.unique.bed
# transposon sizes file, required to make bigWig
transposon.sizes
# bigWig file for each transcripts (gene, piRNA cluster and repBase transposon)
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.plus.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.minus.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.plus.bigWig
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.minus.bigWig
# ParaFly file, ignore
bigWigSummary.para
bigWigSummary.para.completed
# summary file used to draw plots representing signal across different
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.bigWig.summary
# output of eXpress calculation. No normalization is used in results.xprs.
# Please see its document for detailed explanation.
# Note that the data will be organized and used in the dual-sample mode
# we suggest to look at data there
results.xprs
params.xprs
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.eXpress.log
# normalized results, using the gene compatible reads calculated by Cufflinks
results.xprs.normalized
Example 2. Run dual-sample mode to compare RNA-seq libraries from two samples, each with two biological replicates
# download all the data: two replicates for each samples mybl1+/- and mybl1-/-
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep1 SRR765641
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep2 SRR765642
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_mut.testis.14dpp.rep1 SRR765647
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_mut.testis.14dpp.rep2 SRR765648
# run the four libraries separately
piPipes rna \
-l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
-r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
piPipes rna \
-l Zamore.RSQ.amyb_het.testis.14dpp.rep2_1.fastq.gz \
-r Zamore.RSQ.amyb_het.testis.14dpp.rep2_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_het.testis.14dpp.rep2.piPipes_out \
1> Zamore.RSQ.amyb_het.testis.14dpp.rep2.stdout \
2> Zamore.RSQ.amyb_het.testis.14dpp.rep2.stderr
piPipes rna \
-l Zamore.RSQ.amyb_mut.testis.14dpp.rep1_1.fastq.gz \
-r Zamore.RSQ.amyb_mut.testis.14dpp.rep1_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_mut.testis.14dpp.rep1.piPipes_out \
1> Zamore.RSQ.amyb_mut.testis.14dpp.rep1.stdout \
2> Zamore.RSQ.amyb_mut.testis.14dpp.rep1.stderr
piPipes rna \
-l Zamore.RSQ.amyb_mut.testis.14dpp.rep2_1.fastq.gz \
-r Zamore.RSQ.amyb_mut.testis.14dpp.rep2_2.fastq.gz \
-g mm9 \
-o Zamore.RSQ.amyb_mut.testis.14dpp.rep2.piPipes_out \
1> Zamore.RSQ.amyb_mut.testis.14dpp.rep2.stdout \
2> Zamore.RSQ.amyb_mut.testis.14dpp.rep2.stderr
# -a: comma delimited directories with RNA-seq single library mode, for sample A
# -b: comma delimited directories with RNA-seq single library mode, for sample B
# -g: mouse mm9 genome
# -o: output directory
# -A: name for sample A, used in output
# -B: name for sample B, used in output
piPipes rna2 \
-a Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out,Zamore.RSQ.amyb_het.testis.14dpp.rep2.piPipes_out \
-b Zamore.RSQ.amyb_mut.testis.14dpp.rep1.piPipes_out,Zamore.RSQ.amyb_mut.testis.14dpp.rep2.piPipes_out \
-g mm9 \
-o Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp \
-A amybHet14 \
-B amybMut14 \
1> Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp.piPipes.stdout \
2> Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp.piPipes.stderr
The dual-sample mode is very simple:
For genes quantification, it uses Cuffdiff
to call differentially expressed genes/transcripts from genome-mapping approach and cummeRbund
to generate figures.
For transposon quantification, it generates scatter-plot from eXpress
output.
# output
cuffdiff_output/
pdfs/
amybHet14.results.xprs
amybMut14.results.xprs
cuffdiff_output/
contains output of Cuffdiff
.
# Cuffdiff and cummeRbund output
var_model.info
amyb.cuffdiff.log
isoform_exp.diff
tss_group_exp.diff
gene_exp.diff
cds_exp.diff
splicing.diff
promoters.diff
cds.diff
isoforms.fpkm_tracking
tss_groups.fpkm_tracking
cds.fpkm_tracking
# fpkm values for genes
genes.fpkm_tracking
# fpkm values for isoforms
isoforms.count_tracking
tss_groups.count_tracking
cds.count_tracking
genes.count_tracking
isoforms.read_group_tracking
tss_groups.read_group_tracking
cds.read_group_tracking
genes.read_group_tracking
read_groups.info
run.info
bias_params.info
cuffData.db
pdfs/
contains pdfs
# the following files are generated by Cuffdiff + cummeRbund, for genes
# they should be self-explanatory
# please see cummeRbund document for detailed information
amyb.genes.csDensity.pdf
amyb.genes.csBoxplot.pdf
amyb_amybHet14_vs_amybMut14.genes.csScatter.pdf
amyb_amybHet14_vs_amybMut14.genes.csVolcano.pdf
amyb.genes.csHeatMap_top50.pdf
amyb.genes.csExpressionBarplot_top50.pdf
amyb.isoforms.csDensity.pdf
amyb.cummeRbund.log
amyb.isoforms.csBoxplot.pdf
# the following files are generated from eXpress output; for gene + piRNA cluster + transposon
# summary file
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance.csv
# this scatter-plot has each dot representing a gene isoforms, a non-coding RNA, piRNA cluster and
# transposon usually we notice that gene and non-coding RNA align pretty well on the diagonal
# but transposon reads shifted pretty badly to one side
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance1.pdf
# this is a simplified version using contour lines to represent gene and non-coding RNA transcripts
# because we found that there are too many dots in the previous pdf and very hard to manipulate/view
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance2.pdf
Please see example figures from the Github Wiki page or our manuscript.
##Flowchart and example figures from our manuscript