Skip to content

Differential Expression for RNA Seq: Starting from Fastqs

Benilton Carvalho edited this page Jul 28, 2015 · 2 revisions

Pre-requisites

This task can only be completed if the following pre-requisites are available/installed on the user's computer:

  1. R version 3.1.0 or greater
  2. RStudio Desktop 0.98.1028 or greater
  3. Bioconductor packages (Rqc, Rsubread, DESeq2 and genefilter) and an R/CRAN package (RColorBrewer)
    • To install the packages, open RStudio and run the following commands:
    source('http://www.bioconductor.org/biocLite.R')
    biocLite(c('ShortRead', 'Rqc', 'Rsubread', 'DESeq2', 'genefilter', 'RColorBrewer'))
    

Identify Samples

Create a directory called fastq and place these FastQ files there.

Get the Reference Genome

Create a directory called 'reference' and place these files (FASTA + GTF) there.

Identify FASTQ files

Load R and from within it, list all the files from the fastq directory.

library(ShortRead)
f1regex = "\\_R1\\.fastq\\.gz"
f2regex = "\\_R2\\.fastq\\.gz"
fastqs1 = list.files('fastq', pattern='R1', full=TRUE)
fastqs2 = list.files('fastq', pattern='R2', full=TRUE)
samples = gsub(f1regex, "", basename(fastqs1))

Initial Quality Control Using Rqc

library(Rqc)
out = 'qcresults'
dir.create(out)
rqc('fastq', pattern=f1regex, openBrowser=FALSE, outdir=out, file='rqc_r1')
rqc('fastq', pattern=f2regex, openBrowser=FALSE, outdir=out, file='rqc_r2')

Mapping Reads with Rsubread

All aligners, before running the actual mapping, require a genome index to be created. Therefore, our first task is to build the index associated to the RN5 genome that we downloaded to the reference directory.

library(Rsubread)
idx = 'reference/rn5index'
ref = list.files('reference', pattern='fa$', full=TRUE)
buildindex(basename=idx, reference=ref)

Now, having the genome index, we can easily map the (paired-end) fastq files to the reference genome.

dir.create('bams')
n = length(fastqs1)
for (i in 1:n){
  outfile = file.path('bams', paste0(samples[i], '.bam'))
  sorted = file.path('bams', paste0(samples[i], '.sorted'))
  align(idx, readfile1=fastqs1[i], readfile2=fastqs2[i], input_format='gzFASTQ', output_file=outfile, nthreads=4)
  sortBam(outfile, sorted)
  indexBam(paste0(sorted, '.bam'))
}

Feature Counting - Creating Count Table

Having aligned the reads, we choose to count the features of interest. The resulting table is what we need later to model the data and identify genes that are candidates for differential expression. The count table can be obtained in a number of ways. Here, we use the Rsubread::featureCounts method, but the user can use HTSeq (Python) and the easyRNASeq Bioconductor package.

bamfiles = list.files('bams', pattern='sorted\\.bam$', full=TRUE)
gtf = list.files('reference', pattern='gtf', full=TRUE)
fcounts = featureCounts(bamfiles, annot.ext=gtf, isPairedEnd=TRUE, isGTFAnnotationFile=TRUE, nthreads=4, ignoreDup=TRUE)
actual = fcounts$counts
head(actual)
tail(actual)

Preparing the Data for DESeq2

The DESeq2 package implements tools for one to process RNA-Seq data. To load a package in R, we use the library command. Several messages may appear and they describe the set of dependencies that are required by DESeq2.

library(DESeq2)

Prior to handling the count table, we need to define a table containing the relevant phenotypic information that will be used during data modelling. Below, we use the command data.frame to create such table. Each column can contain data from different types (e.g., numbers on one column and text on another). Here, we define two columns: group and lane. The available samples are divided in two groups (control and treated). The lane information is an important feature that allows for better variance control.

pd <- data.frame(group=rep(c('control', 'treated'), each=3),
                 lane=rep(c('lane1', 'lane2', 'lane3'), 2))
pd

Note that the phenotypic table can also be created using spreadsheet software (like Excel, Numbers, Calc). For simplicity, the user should save the contents of the file as a tab-delimited file. If this is done, then the read.table command can be used to read the resulting file, as it was done earlier to read the count table.

We use the DESeqDataSetFromMatrix command to load both our data and phenotypic information into a DESeq2-compatible object. With this command, we use a formula to define the variables that should be incorporated into the model for differential expression. On the expression below, we specify that both group and lane (that are stored in the pd object) should be accounted when modelling the gene counts stored on the actual object.

rawData <- DESeqDataSetFromMatrix(actual, pd, ~ group + lane)

Modelling Data with DESeq2

The rawData object defined above is compatible with DESeq2 and contains all the information relevant to the analysis: the counts (actual), the phenotypic data (pd) and the variables to be accounted for (~ group + lane). Below, we use the DESeq command to fit the negative binomial model to each of the genes. The workflow implemented in DESeq includes:

  • Size-factor estimation
  • Dispersion estimation
  • Mean-dispersion association
  • Final dispersion estimation
  • Model fitting and testing
dds <- DESeq(rawData)
resultsNames(dds)

Below, we use the results command to extract the results associated to the comparisons of interest. Below, with the contrast argument, we specify that we are interested in comparing the treated samples to the control ones and that the variable group is the one that contains such information. We sort the resulting table using the adjusted p-values (padj) and this will allow the easier identification of genes that are candidates for differential expression.

res <- results(dds, contrast=c('group', 'treated', 'control'))
res <- res[order(res$padj),]
res

The object res contains a number of information about each of the genes and the statistics associated to the experiment. Below, we define a threshold (arbitrary) to select candidates for differential expression. Specifically, we select genes whose adjusted p-value are below the 0.10 threshold. The subset command scans the res object for genes with p-values smaller then 0.10. The rownames command extract the names of the genes that were just selected.

top10 <- subset(res, padj < 1 & !is.na(padj))
rns10 <- rownames(top10)
rns10

Data Visualization

A strategy for data visualization is the MA-plot. On the Y-axis, it shows the log-ratio (fold-change in the log-scale) comparing the treated group to the control one. On the X-axis, it presents the average expression (across groups). In red, it shows the genes that were selected for having adjusted p-values under 0.10.

plotMA(res, alpha=0.10, main='Treated vs. Control')

Other visualization strategies can be seriously affected by differences in gene-specific variance. The DESeq2 package implements the regularized-log transformation that will equalize the data for visualization (only visualization). The argument blind specifies that the algorithm should not use the phenotypic information for variance stabilization.

rld <- rlogTransformation(dds, blind=TRUE)

Below, we use Principal Component Analysis (PCA) to visualize the equalized (via regularized-log transformation) data. This strategy is helpful on identifying biases and undetected grouping patterns on the data. The intergroup argument can be used to define how the data points are colored.

plotPCA(rld, intgroup='group')
plotPCA(rld, intgroup='lane')
plotPCA(rld, intgroup=c('group', 'lane'))

We use the RColorBrewer package to define a set of 100 colors to create a gradient from red to blue (RdBu) that will be used to present the expression values of the genes identified above as candidates for differential expression.

library(RColorBrewer)
hmcol <- rev(colorRampPalette(brewer.pal(9, "RdBu"))(100))

Using the variance-stabilized data, we extract the normalized counts for visualization. These counts refer only to the genes whose adjusted p-values did not exceed 0.10. Using the heatmap command, we generate a heatmap with the normalized counts of these genes. Because they are normalized data, we do not need to scale the observations (scale='none'). The Rowv and Colv arguments are used so the rows and columns are not re-ordered.

mat10 <- assay(rld, normalized=TRUE)
library(genefilter)
bigVar = order(rowVars(mat10), decreasing=TRUE)[1:10]
heatmap(mat10[bigVar,], col=hmcol, scale='none', Rowv=FALSE, Colv=FALSE)

Creating a Report

In this section, we the the ReportingTools package to simplify the creation of an HTML report.

library(ReportingTools)
myreport <- HTMLReport(shortName='DESeq2_Analysis', title='Sample Analysis', reportDirectory='exampleReport')
publish(res, myreport, DataSet=dds, factor=colData(dds)$group, pvalueCutoff=0.10, reportDir='exampleReport')
finish(myreport)