Merge pull request #68 from PF2-pasteur-fr/development

Development
PF2-pasteur-fr · Jul 10, 2019 · e2efcdf · e2efcdf
2 parents f8ba2d9 + 6150a86
commit e2efcdf
Show file tree

Hide file tree

Showing 9 changed files with 67 additions and 46 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,14 +1,15 @@
 Package: SARTools
 Type: Package
 Title: Statistical Analysis of RNA-Seq Tools
-Version: 1.6.8
-Date: 2019-05-29
+Version: 1.6.9
+Date: 2019-07-10
 Author: Marie-Agnes Dillies and Hugo Varet
 Maintainer: Hugo Varet <[email protected]>
 Depends: R (>= 3.3.0), DESeq2 (>= 1.12.0), edgeR (>= 3.12.0), xtable
 Imports: stats, utils, graphics, grDevices, knitr, rmarkdown (>= 1.4), SummarizedExperiment, S4Vectors, limma, genefilter (>= 1.44.0)
 Suggests: optparse
-VignetteBuilder: knitr
+biocViews: Software
+VignetteBuilder: knitr, rmarkdown
 Encoding: latin1
 Description: Provide R tools and an environment for the statistical analysis of RNA-Seq projects: load and clean data, produce figures, perform statistical analysis/testing with DESeq2 or edgeR, export results and create final report.
 License: GPL-2

diff --git a/NEWS b/NEWS
@@ -1,3 +1,9 @@
+CHANGES IN VERSION 1.6.9
+------------------------
+	o new vignette style
+	o print duplicated feature ids alongwith the error message in loadCountData()
+	o README update as Bioconductor dependencies are now automatically installed
+
 CHANGES IN VERSION 1.6.8
 ------------------------
 	o fixed a bug to build the vignette when installing the package from GitHub

diff --git a/R/loadCountData.R b/R/loadCountData.R
@@ -36,15 +36,21 @@ loadCountData <- function(target, rawDir="raw", skip=0, featuresToRemove=c("alig
   rawCounts <- read.table(file.path(rawDir, files[1]), sep="\t", quote="\"", header=header, skip=skip, stringsAsFactors=FALSE)
   rawCounts <- rawCounts[,c(idCol, countsCol)]
   colnames(rawCounts) <- c("Id", labels[1])
-  if (any(duplicated(rawCounts$Id))) stop("Duplicated feature names in ", files[1])
+  if (any(duplicated(rawCounts$Id))){
+    stop("Duplicated feature names in ", files[1], ": ", 
+         paste(unique(rawCounts$Id[duplicated(rawCounts$Id)]), collapse=", "))
+  }
   cat("Loading files:\n")
   cat(files[1],": ",length(rawCounts[,labels[1]])," rows and ",sum(rawCounts[,labels[1]]==0)," null count(s)\n",sep="")
 
   for (i in 2:length(files)){
     tmp <- read.table(file.path(rawDir, files[i]), sep="\t", quote="\"", header=header, skip=skip, stringsAsFactors=FALSE)
     tmp <- tmp[,c(idCol, countsCol)]
     colnames(tmp) <- c("Id", labels[i])
-    if (any(duplicated(tmp$Id))) stop("Duplicated feature names in ", files[i])
+    if (any(duplicated(tmp$Id))){
+      stop("Duplicated feature names in ", files[i], ": ", 
+           paste(unique(tmp$Id[duplicated(tmp$Id)]), collapse=", "))
+    }
     rawCounts <- merge(rawCounts, tmp, by="Id", all=TRUE)
     cat(files[i],": ",length(tmp[,labels[i]])," rows and ",sum(tmp[,labels[i]]==0)," null count(s)\n",sep="")
   }

diff --git a/README.md b/README.md
@@ -13,16 +13,15 @@ How to install SARTools?
 In addition to the SARTools package itself, the workflow requires the installation of several packages: DESeq2, edgeR, genefilter, xtable and knitr (all available online, see the dedicated webpages). SARTools needs R version 3.3.0 or higher, DESeq2 1.12.0 or higher and edgeR 3.12.0 or higher: old versions of DESeq2 or edgeR may be incompatible with SARTools.
 
 To install the SARTools package from GitHub, open a R session and:
-- install DESeq2, edgeR and genefilter with `if (!requireNamespace("BiocManager")){install.packages("BiocManager")}` and `BiocManager::install(c("DESeq2", "edgeR", "genefilter"))` (if not installed yet)
-- install devtools with `install.packages("devtools")` (if not installed yet)
+- Install devtools with `install.packages("devtools")` (if not installed yet)
 - Notes:
 
 	- Ubuntu users may have to install some libraries (libxml2-dev, libcurl4-openssl-dev and libssl-dev) to be able to install DESeq2 and devtools
 	- Some users may have to install the pandoc and pandoc-citeproc libraries to be able to generate the final HTML reports
 
-- for Windows users only, install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) or check that it is already installed (needed to build the package)
-- load the devtools R package with `library(devtools)`
-- run `install_github("PF2-pasteur-fr/SARTools", build_opts="--no-resave-data")`
+- For Windows users only, install [Rtools](https://cran.r-project.org/bin/windows/Rtools/) or check that it is already installed (needed to build the package)
+- Load the devtools R package with `library(devtools)`
+- Run `install_github("PF2-pasteur-fr/SARTools", build_opts="--no-resave-data")`
 
 ### Using Conda
 

diff --git a/template_script_DESeq2.r b/template_script_DESeq2.r
@@ -2,7 +2,7 @@
 ### R script to compare several conditions with the SARTools and DESeq2 packages
 ### Hugo Varet
 ### March 20th, 2018
-### designed to be executed with SARTools 1.6.7
+### designed to be executed with SARTools 1.6.9
 ################################################################################
 
 ################################################################################

diff --git a/template_script_DESeq2_CL.r b/template_script_DESeq2_CL.r
@@ -2,7 +2,7 @@
 ### R script to compare several conditions with the SARTools and DESeq2 packages
 ### Hugo Varet
 ### March 20th, 2018
-### designed to be executed with SARTools 1.6.7
+### designed to be executed with SARTools 1.6.9
 ### run "Rscript template_script_DESeq2_CL.r --help" to get some help
 ################################################################################
 

diff --git a/template_script_edgeR.r b/template_script_edgeR.r
@@ -2,7 +2,7 @@
 ### R script to compare several conditions with the SARTools and edgeR packages
 ### Hugo Varet
 ### March 20th, 2018
-### designed to be executed with SARTools 1.6.7
+### designed to be executed with SARTools 1.6.9
 ################################################################################
 
 ################################################################################

diff --git a/template_script_edgeR_CL.r b/template_script_edgeR_CL.r
@@ -2,7 +2,7 @@
 ### R script to compare several conditions with the SARTools and edgeR packages
 ### Hugo Varet
 ### May 16th, 2018
-### designed to be executed with SARTools 1.6.7
+### designed to be executed with SARTools 1.6.9
 ### run "Rscript template_script_edgeR_CL.r --help" to get some help
 ################################################################################
 

diff --git a/vignettes/SARTools.rmd b/vignettes/SARTools.rmd
@@ -1,16 +1,26 @@
-<!--
-%\VignetteEngine{knitr::knitr}
-\usepackage[utf8]{inputenc}
-%\VignetteIndexEntry{tutorial}
--->
-
-# SARTools vignette for the differential analysis of 2 or more conditions with DESeq2 or edgeR
-
-SARTools version: `r packageVersion("SARTools")`
-
-Authors: M.-A. Dillies and H. Varet ([email protected]) - Transcriptome and Epigenome Platform, Institut Pasteur, Paris
-
-Website: https://github.com/PF2-pasteur-fr/SARTools
+---
+title: "SARTools vignette for the differential analysis of 2 or more conditions with DESeq2 or edgeR"
+author: "M.-A. Dillies and H. Varet ([email protected]) - Transcriptome and Epigenome Platform, Institut Pasteur, Paris"
+date: "`r format(Sys.Date(), '%m/%d/%Y')`"
+abstract: >
+  [SARTools](https://github.com/PF2-pasteur-fr/SARTools) (version `r packageVersion("SARTools")`) is a R package dedicated to the differential analysis of RNA-seq data. It provides tools to generate descriptive and diagnostic graphs, to run the differential analysis with one of the well known DESeq2 or edgeR packages and to export the results into easily readable tab-delimited files. It also facilitates the generation of a HTML report which displays all the figures produced, explains the statistical methods and gives the results of the differential analysis. Note that SARTools does not intend to replace DESeq2 or edgeR: it simply provides an environment to go with them. For more details about the methodology behind DESeq2 or edgeR, the user should read their documentations and papers.
+
+output:
+  rmarkdown::html_document:
+    highlight: pygments
+    toc: true
+    fig_width: 5
+vignette: >
+  %\VignetteIndexEntry{tutorial}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+  %\usepackage[utf8]{inputenc}
+---
+
+```{r setup, echo=FALSE, results="hide"}
+knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png",
+                      message=FALSE, error=FALSE, warning=TRUE)
+```
 
 ## 1 Introduction
 
@@ -141,28 +151,28 @@ For a variety of reasons, it might happen that some sample names are erroneously
 
 The first tool to detect the inversion is the SERE statistic [11] since its goal is to measure the similarity between samples. The SERE values obtained are displayed on the lower triangle of the figure 1. We clearly observe that KO3 is more similar to WT1 (SERE=1.7) than to KO2 (3.4), which potentially reveals a problem within the samples under study. The same phenomenon happens with WT3 which is more similar to KO1 (1.6) than to WT1 (4.59).
 
-![figures/inversionpairwiseScatter.png](figures/inversionpairwiseScatter.png)
+![](figures/inversionpairwiseScatter.png)
 
 Figure 1: pairwise scatter plot and SERE statistics when inverting samples.
 
 The clustering can also help detect such an inversion of samples. Indeed, on the dendrogram, samples from the same biological condition are supposed to cluster together while samples from two different biological conditions should group only at the final step of the algorithm. Figure 2 (left) shows the dendrogram obtained: we can see that KO3 clusters immediately with WT1 and WT2 while WT3 clusters with KO1 and KO2. 
 
 The Principal Component Analysis on the right panel of figure 2 (or the Multi-Dimensional Scaling plot) is a tool which allows exploration of the structure of the data. Samples are displayed on a two dimensional graph which can help the user to assess the distances between samples. The PCA presented here leads to the same conclusion as the dendrogram. 
 
-![figures/inversionclusterPCA.png](figures/inversionclusterPCA.png)
+![](figures/inversionclusterPCA.png)
 
 Figure 2: clustering dendrogram (left) and PCA (right) when inverting samples.
 
 Finally, when testing for differential expression, if two samples have been inverted during the process, the histogram of the raw p-values can have an unexpected shape. Instead of having a uniform distribution, with a possible peak at 0 for the differentially expressed features, the distribution may be skewed toward the right (figure 3).
 
-![figures/inversionrawpHist.png](figures/inversionrawpHist.png)
+![](figures/inversionrawpHist.png)
 
 Figure 3: raw p-values histogram when inverting samples.
 
 ### 4.2 Batch effect
-A batch effect is a source of variation in the counts due to splitting the whole sample set into subgroups during the wet-lab part of the experiment. To illustrate this phenomenon, figure 4 shows the results of the clustering and of the PCA for an experiment with 12 samples: 6 WT and 6 KO labeled from 1 to 6 within each condition. 
+A batch effect is a source of variation in the counts due to splitting the whole sample set into subgroups during the wet-lab part of the experiment. To illustrate this phenomenon, figure 4 shows the results of the clustering and of the PCA for an experiment with 12 samples: 6 WT and 6 KO labeled from 1 to 6 within each condition.
 
-![figures/batchclusterPCA.png](figures/batchclusterPCA.png)
+![](figures/batchclusterPCA.png)
 
 Figure 4: clustering dendrogram (left) and PCA (right) with a batch effect.
 
@@ -175,11 +185,11 @@ Warning: batch effects can be taken into account only if they do not confound wi
 ### 4.3 Number of reads and outliers
 A sample with a total number of reads or a number of null counts too much different from the others may reveal a problem during the experiment, the sequencing or the alignment. The user can check this in the two first barplots of the HTML report (total number of reads and percentage of null counts). Moreover, such a sample will probably be outlier on the PCA/MDS plot, i.e. it will fall far from the other samples. It will often be preferable to remove it from the statistical analysis. For example, the figures 5 and 6 illustrate this phenomenon and suggest the removal of sample WT3 from the analysis.
 
-![figures/outlierbarplots.png](figures/outlierbarplots.png)
+![](figures/outlierbarplots.png)
 
 Figure 5: WT3 has a small total number of reads (left) and a high percentage of null counts (right).
 
-![figures/outlierPCA.png](figures/outlierPCA.png)
+![](figures/outlierPCA.png)
 
 Figure 6: WT3 falls far from the other samples on the first factorial plane of the PCA.
 
@@ -189,7 +199,7 @@ It may happen that some features (ribosomal RNA for example) take up a large num
 ### 4.5 Normalization parameter (with DESeq2)
 In order to normalize the counts, DESeq2 computes size factors. There are two options to compute them: `"median"` (default) or `"shorth"`. The default parameter often works well but the HTML report contains a figure which allows an assessment of the quality of the estimation of the size factors: there is one histogram per sample with a vertical red line corresponding to the value of the size factor. If the estimation of the size factor is correct, the red line must fall on the mode of the histogram for each sample. If this is not the case, the user should use the `"shorth"` parameter. Results with `"median"` and `"shorth"` for the same sample are given on figure 7 for an experiment where it was preferable to use `"shorth"`.
 
-![figures/diagSF.png](figures/diagSF.png)
+![](figures/diagSF.png)
 
 Figure 7: size factor diagnostic for one sample with `"median"` (left) and `"shorth"` (right).
 
@@ -206,25 +216,24 @@ The user can try the R scripts `template_script_DESeq2.r` and `template_script_e
 
 ## Bibliography
 
-[1] Anders S, Huber W. **Differential expression analysis for sequence count data**. *Genome Biology*. 2010; doi:10.1186/gb-2010-11-10-r106.
+[1] Anders S, Huber W. **Differential expression analysis for sequence count data**. *Genome Biology*. 2010.
 
-[2] Love M, Huber W, Anders S. **Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2**. *Genome Biology*. 2014; doi:10.1186/s13059-014-0550-8.
+[2] Love M, Huber W, Anders S. **Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2**. *Genome Biology*. 2014.
 
-[3] Robinson M, McCarthy DJ, Smyth GK. **edgeR: a Bioconductor package for differential expression analysis of digital gene expression data**. *Bioinformatics*. 2009; doi:10.1093/bioinformatics/btp616.
+[3] Robinson M, McCarthy DJ, Smyth GK. **edgeR: a Bioconductor package for differential expression analysis of digital gene expression data**. *Bioinformatics*. 2009.
 
-[4] Anders S, Pyl TP, Huber W. **HTSeq - A Python framework to work with high-throughput sequencing data**. *Bioinformatics*. 2014; doi:10.1093/bioinformatics/btu638.
+[4] Anders S, Pyl TP, Huber W. **HTSeq - A Python framework to work with high-throughput sequencing data**. *Bioinformatics*. 2014.
 
-[5] Liao Y, Smyth GK and Shi W. **featureCounts: an efficient general purpose program for assigning sequence reads to genomic features**. *Bioinformatics*, 2014; doi:10.1093/bioinformatics/btt656.
+[5] Liao Y, Smyth GK and Shi W. **featureCounts: an efficient general purpose program for assigning sequence reads to genomic features**. *Bioinformatics*, 2014.
 
-[6] Ritchie ME, Phipson B, Wu D, et al. **limma powers differential expression analyses for RNA-sequencing and microarray studies**. *Nucleic Acids Research*. 2015; doi:10.1093/nar/gkv007.
+[6] Ritchie ME, Phipson B, Wu D, et al. **limma powers differential expression analyses for RNA-sequencing and microarray studies**. *Nucleic Acids Research*. 2015.
 
-[7] Cook RD. **Detection of Influential Observation in Linear Regression**. *Technometrics*. 1977; DOI:10.1080/00401706.2000.10485981.
+[7] Cook RD. **Detection of Influential Observation in Linear Regression**. *Technometrics*. 1977.
 
-[8] Bourgon R, Gentleman R and Huber W. **Independent filtering increases detection power for high-throughput experiments**. *PNAS*. 2010; doi:10.1073/pnas.0914005107.
+[8] Bourgon R, Gentleman R and Huber W. **Independent filtering increases detection power for high-throughput experiments**. *PNAS*. 2010.
 
-[9] Benjamini Y and Hochberg Y. **Controlling the false discovery rate: a practical and powerful approach to multiple testing**. *Journal of the Royal Statistical Society B*. 1995; doi:10.2307/2346101.
+[9] Benjamini Y and Hochberg Y. **Controlling the false discovery rate: a practical and powerful approach to multiple testing**. *Journal of the Royal Statistical Society B*. 1995.
 
 [10] Benjamini Y and Yekutieli D. **The control of the false discovery rate in multiple testing under dependency**. *Annals of Statistics*. 2001.
 
-[11] Schulze SK, Kanwar R, Golzenleuchter M, et al. **SERE: Single-parameter quality control and sample comparison for RNA-Seq**. *BMC Genomics*. 2012; doi:10.1186/1471-2164-13-524.
-
+[11] Schulze SK, Kanwar R, Golzenleuchter M, et al. **SERE: Single-parameter quality control and sample comparison for RNA-Seq**. *BMC Genomics*. 2012.