-
Notifications
You must be signed in to change notification settings - Fork 0
Differential Expression for RNA Seq: Starting from Fastqs
This task can only be completed if the following pre-requisites are available/installed on the user's computer:
- R version 3.1.0 or greater
- RStudio Desktop 0.98.1028 or greater
- RStudio Desktop 0.98.1028 for Windows
- RStudio Desktop 0.98.1028 for Debian 6+/Ubuntu 10.04+ 32-bit
- RStudio Desktop 0.98.1028 for Debian 6+/Ubuntu 10.04+ 64-bit
- RStudio 0.98.1028 - Fedora 13+/openSUSE 11.4+ 32-bit
- RStudio 0.98.1028 - Fedora 13+/openSUSE 11.4+ 64-bit
- RStudio 0.98.1028 - Mac OS X 10.6+ 64-bit
- Bioconductor packages (
Rqc
,Rsubread
,DESeq2
andgenefilter
) and an R/CRAN package (RColorBrewer
)- To install the packages, open RStudio and run the following commands:
source('http://www.bioconductor.org/biocLite.R') biocLite(c('ShortRead', 'Rqc', 'Rsubread', 'DESeq2', 'genefilter', 'RColorBrewer'))
Create a directory called fastq
and place these FastQ files there.
Create a directory called 'reference' and place these files (FASTA + GTF) there.
Load R
and from within it, list all the files from the fastq
directory.
library(ShortRead)
f1regex = "\\_R1\\.fastq\\.gz"
f2regex = "\\_R2\\.fastq\\.gz"
fastqs1 = list.files('fastq', pattern='R1', full=TRUE)
fastqs2 = list.files('fastq', pattern='R2', full=TRUE)
samples = gsub(f1regex, "", basename(fastqs1))
library(Rqc)
out = 'qcresults'
dir.create(out)
rqc('fastq', pattern=f1regex, openBrowser=FALSE, outdir=out, file='rqc_r1')
rqc('fastq', pattern=f2regex, openBrowser=FALSE, outdir=out, file='rqc_r2')
All aligners, before running the actual mapping, require a genome index to be created. Therefore, our first task is to build the index associated to the RN5 genome that we downloaded to the reference
directory.
library(Rsubread)
idx = 'reference/rn5index'
ref = list.files('reference', pattern='fa$', full=TRUE)
buildindex(basename=idx, reference=ref)
Now, having the genome index, we can easily map the (paired-end) fastq files to the reference genome.
dir.create('bams')
n = length(fastqs1)
for (i in 1:n){
outfile = file.path('bams', paste0(samples[i], '.bam'))
sorted = file.path('bams', paste0(samples[i], '.sorted'))
align(idx, readfile1=fastqs1[i], readfile2=fastqs2[i], input_format='gzFASTQ', output_file=outfile, nthreads=4)
sortBam(outfile, sorted)
indexBam(paste0(sorted, '.bam'))
}
Having aligned the reads, we choose to count the features of interest. The resulting table is what we need later to model the data and identify genes that are candidates for differential expression. The count table can be obtained in a number of ways. Here, we use the Rsubread::featureCounts
method, but the user can use HTSeq
(Python) and the easyRNASeq
Bioconductor package.
bamfiles = list.files('bams', pattern='sorted\\.bam$', full=TRUE)
gtf = list.files('reference', pattern='gtf', full=TRUE)
fcounts = featureCounts(bamfiles, annot.ext=gtf, isPairedEnd=TRUE, isGTFAnnotationFile=TRUE, nthreads=4, ignoreDup=TRUE)
actual = fcounts$counts
head(actual)
tail(actual)
The DESeq2
package implements tools for one to process RNA-Seq data. To load a package in R, we use the library
command. Several messages may appear and they describe the set of dependencies that are required by DESeq2
.
library(DESeq2)
Prior to handling the count table, we need to define a table containing the relevant phenotypic information that will be used during data modelling. Below, we use the command data.frame
to create such table. Each column can contain data from different types (e.g., numbers on one column and text on another). Here, we define two columns: group and lane. The available samples are divided in two groups (control
and treated
). The lane information is an important feature that allows for better variance control.
pd <- data.frame(group=rep(c('control', 'treated'), each=3),
lane=rep(c('lane1', 'lane2', 'lane3'), 2))
pd
Note that the phenotypic table can also be created using spreadsheet software (like Excel, Numbers, Calc). For simplicity, the user should save the contents of the file as a tab-delimited file. If this is done, then the read.table
command can be used to read the resulting file, as it was done earlier to read the count table.
We use the DESeqDataSetFromMatrix
command to load both our data and phenotypic information into a DESeq2-compatible object. With this command, we use a formula to define the variables that should be incorporated into the model for differential expression. On the expression below, we specify that both group
and lane
(that are stored in the pd
object) should be accounted when modelling the gene counts stored on the actual
object.
rawData <- DESeqDataSetFromMatrix(actual, pd, ~ group + lane)
The rawData
object defined above is compatible with DESeq2
and contains all the information relevant to the analysis: the counts (actual
), the phenotypic data (pd
) and the variables to be accounted for (~ group + lane
). Below, we use the DESeq
command to fit the negative binomial model to each of the genes. The workflow implemented in DESeq includes:
- Size-factor estimation
- Dispersion estimation
- Mean-dispersion association
- Final dispersion estimation
- Model fitting and testing
dds <- DESeq(rawData)
resultsNames(dds)
Below, we use the results
command to extract the results associated to the comparisons of interest. Below, with the contrast
argument, we specify that we are interested in comparing the treated
samples to the control
ones and that the variable group
is the one that contains such information. We sort the resulting table using the adjusted p-values (padj
) and this will allow the easier identification of genes that are candidates for differential expression.
res <- results(dds, contrast=c('group', 'treated', 'control'))
res <- res[order(res$padj),]
res
The object res
contains a number of information about each of the genes and the statistics associated to the experiment. Below, we define a threshold (arbitrary) to select candidates for differential expression. Specifically, we select genes whose adjusted p-value are below the 0.10 threshold. The subset
command scans the res
object for genes with p-values smaller then 0.10. The rownames
command extract the names of the genes that were just selected.
top10 <- subset(res, padj < 1 & !is.na(padj))
rns10 <- rownames(top10)
rns10
A strategy for data visualization is the MA-plot. On the Y-axis, it shows the log-ratio (fold-change in the log-scale) comparing the treated
group to the control
one. On the X-axis, it presents the average expression (across groups). In red, it shows the genes that were selected for having adjusted p-values under 0.10.
plotMA(res, alpha=0.10, main='Treated vs. Control')
Other visualization strategies can be seriously affected by differences in gene-specific variance. The DESeq2
package implements the regularized-log transformation that will equalize the data for visualization (only visualization). The argument blind
specifies that the algorithm should not use the phenotypic information for variance stabilization.
rld <- rlogTransformation(dds, blind=TRUE)
Below, we use Principal Component Analysis (PCA) to visualize the equalized (via regularized-log transformation) data. This strategy is helpful on identifying biases and undetected grouping patterns on the data. The intergroup
argument can be used to define how the data points are colored.
plotPCA(rld, intgroup='group')
plotPCA(rld, intgroup='lane')
plotPCA(rld, intgroup=c('group', 'lane'))
We use the RColorBrewer
package to define a set of 100 colors to create a gradient from red to blue (RdBu
) that will be used to present the expression values of the genes identified above as candidates for differential expression.
library(RColorBrewer)
hmcol <- rev(colorRampPalette(brewer.pal(9, "RdBu"))(100))
Using the variance-stabilized data, we extract the normalized counts for visualization. These counts refer only to the genes whose adjusted p-values did not exceed 0.10. Using the heatmap
command, we generate a heatmap with the normalized counts of these genes. Because they are normalized data, we do not need to scale the observations (scale='none'
). The Rowv
and Colv
arguments are used so the rows and columns are not re-ordered.
mat10 <- assay(rld, normalized=TRUE)
library(genefilter)
bigVar = order(rowVars(mat10), decreasing=TRUE)[1:10]
heatmap(mat10[bigVar,], col=hmcol, scale='none', Rowv=FALSE, Colv=FALSE)
In this section, we the the ReportingTools
package to simplify the creation of an HTML report.
library(ReportingTools)
myreport <- HTMLReport(shortName='DESeq2_Analysis', title='Sample Analysis', reportDirectory='exampleReport')
publish(res, myreport, DataSet=dds, factor=colData(dds)$group, pvalueCutoff=0.10, reportDir='exampleReport')
finish(myreport)