From 3c3a757acdd42159178e0a0cba66fbec7f70e634 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Herv=C3=A9=20Pag=C3=A8s?= <hpages@fredhutch.org>
Date: Mon, 17 Jun 2019 00:27:59 -0700
Subject: [PATCH] some additions (also remove trailing blanks)

---
 DESCRIPTION                              |   1 +
 vignettes/Bioc_201_herveQian_LazyRep.Rmd | 155 +++++++++++++----------
 2 files changed, 86 insertions(+), 70 deletions(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index c6b7f30..675349f 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,5 +1,6 @@
 Package: BiocWorkshops2019
 Type: Package
+Encoding: UTF-8
 Title: Bioconductor 2019 Conference Workshops
 Version: 3.9.1
 Description: Lazy representation of very large genomic Data resources in R/Bioconductor
diff --git a/vignettes/Bioc_201_herveQian_LazyRep.Rmd b/vignettes/Bioc_201_herveQian_LazyRep.Rmd
index 28ca96b..cf723d3 100644
--- a/vignettes/Bioc_201_herveQian_LazyRep.Rmd
+++ b/vignettes/Bioc_201_herveQian_LazyRep.Rmd
@@ -41,9 +41,9 @@ options(showTailLines=3)
 ## Workshop Description
 
 In this workshop, we will learn about _Bioconductor_ data containers that
-use lazy data representations and their application in real data analysis.
-We will use some real data examples generated from DNA/RNA-seq, to
-demonstrate the representation and comprehension of large scale
+use lazy data representations and about their application in real data
+analysis. We will use some real data examples generated from DNA/RNA-seq,
+to demonstrate the representation and comprehension of large scale
 (out-of-memory) genomic datasets. The workshop will be mainly instructor-led
 live demo with completely working examples. Instructions and notes will be
 included.
@@ -62,23 +62,23 @@ code chunks.
 
 ## Time outline
 
-Activity | Time 
----------|------
-DelayedArray and HDF5Array objects | 40m
-Specialized DelayedArray backends | 10m
-VariantExperiment | 20m
-DelayedDataFrame | 10m 
-SQLDataFrame | 10m
-Future directions | 5-10m
-	
+| Activity | Time |
+|----------|------|
+| DelayedArray and HDF5Array objects | 40m |
+| Specialized DelayedArray backends | 10m |
+| VariantExperiment | 20m |
+| DelayedDataFrame | 10m |
+| SQLDataFrame | 10m |
+| Future directions | 5-10m |
+
 ## Learning objectives
-- Create `DelayedArray` with different backends (**to be modified by Hervé**)
-- Use `VariantExperiment` to represent your DNA-seq data. 
+- Introduction to `DelayedArray` objects and related concepts (seed, delayed operations, realization, block-processed operations, in-memory vs on-disk backends, etc...)
+- Use `VariantExperiment` to represent your DNA-seq data.
 - Statistical analysis of variant data (VCF) using `VariantExperiment`.
-- Create `SQLDataFrame` from in-memory data, text format file, and database tables. 
+- Create `SQLDataFrame` from in-memory data, text format file, and database tables.
 - Use `SQLDataFrame` to represent and manipulate godb database tables.
-		
-##  _R/Bioconductor_ packages 
+
+##  _R/Bioconductor_ packages
 
 to-be-used in this workshop:
 
@@ -92,12 +92,12 @@ library(SeqArray)
 library(GDSArray)
 library(VCFArray)
 library(DelayedDataFrame)
-library(SQLDataFrame) 
+library(SQLDataFrame)
 library(VariantAnnotation)
 library(org.Hs.eg.db)
 library(TxDb.Hsapiens.UCSC.hg38.knownGene)
 library(DBI)
-#> library(VariantExperiment) 
+#> library(VariantExperiment)
 ```
 
 
@@ -140,7 +140,7 @@ an HDF5 dataset).
 
 Here is an example of a DelayedArray object where the seed is an HDF5 dataset:
 
-```{r}
+```{r, HDF5Array}
 library(HDF5Array)
 A2 <- as(a, "HDF5Array")   # write 'a' to an HDF5 file
 A2
@@ -194,7 +194,7 @@ identical(seed(A2), seed(A3))
 
 Note that `A3`'s memory footprint is still very small because all the data
 is still on disk:
-```
+```{r}
 object.size(A3)
 
 showtree(A3)
@@ -227,18 +227,23 @@ we perform on the object.
 
 Summary of _delayed_ operations supported on DelayedArray objects:
 
-Operation | Comment
-----------|------
-`rbind`, `cbind` | none
-All the members of the `Ops` group | e.g. `+`, `-`, `==`, `<`, etc...
-All the members of the `Math` and `Math2` groups | e.g. `log`, `floor`, etc...
-`sweep` | none
-`!` | none
-`is.na`, `is.finite`, `is.infinite`, `is.nan` | none
-`lengths` | only meaningful when object is of type `list`
-`nchar`, `tolower`, `toupper`, `grepl`, `sub`, `gsub` | none
-`pmax2` and `pmin2` | replacements for `base::pmax` and `base::pmin`
-`dnorm`, `dbinom`, `dpois`, `dlogis` | ... and related functions
+| Operations | Note |
+|------------|------|
+| `x[i_1, i_2, ..., i_n]` | _Multi-dimensional_ single bracket subsetting |
+| `t`, `aperm` | |
+| `rbind`, `cbind` |  |
+| `sweep` | |
+| All the members of the `Ops` group | e.g. `+`, `-`, `==`, `<`, etc... |
+| All the members of the `Math` and `Math2` groups | e.g. `log`, `floor`, etc... |
+| `is.na`, `is.finite`, `is.infinite`, `is.nan` | |
+| `pmax2` and `pmin2` | replacements for `base::pmax` and `base::pmin` |
+| `dnorm`, `pnorm`, `qnorm` | The Normal Distribution |
+| `dbinom`, `pbinom`, `qbinom` | The Binomial Distribution |
+| `dpois`, `ppois`, `qpois` | The Poisson Distribution |
+| `dlogis`, `plogis`, `qlogis` | The Logistic Distribution |
+| `nchar`, `tolower`, `toupper`, `grepl`, `sub`, `gsub` | |
+| `!` | |
+| `lengths` | only meaningful when object is of type `list` |
 
 #### Realization
 
@@ -248,16 +253,26 @@ TODO (by Hervé)
 
 When operations are not _delayed_ they are _block-processed_.
 
-TODO (by Hervé): expnd this
+TODO (by Hervé): expand this
 
 Summary of _block-processed_ operations supported on DelayedArray objects:
 
-Operation | Comment
-----------|--------
- |
+| Operations | Note |
+|------------|------|
+| `x[i]` | _Linear_ single bracket subsetting |
+| `anyNA`, `which` | |
+| `unique`, `table` | |
+| All the members of the `Summary` group | e.g. `max`, `sum`, `all`, etc... |
+| `mean` | |
+| `apply` | |
+| `row/colSums`, `row/colMeans` | Base row/col summarization |
+| `row/colMaxs`, `row/colProds`, etc... | Extended row/col summarization |
+| `rowsum`, `colsum` | |
+
+Note: the _Extended row/col summarization_ functions are defined in the
+matrixStats package and the methods for DelayedMatrix objects are implemented
+in the [DelayedMatrixStats package](https://bioconductor.org/packages/DelayedMatrixStats).
 
-TODO (by Hervé): Refer to _DelayedMatrixStats_ package for more matrix
-summarization operations on DelayedMatrix objects.
 
 ### In-memory vs on-disk objects
 
@@ -326,7 +341,7 @@ be found [here](https://www.biostat.washington.edu/sites/default/files/modules/G
 [SNPRelate]: https://bioconductor.org/packages/SNPRelate
 [SeqArray]: https://bioconductor.org/packages/SeqArray
 
-The GDS file inside R uses hierarchical structure: 
+The GDS file inside R uses hierarchical structure:
 
 ```{r, GDSopen}
 file <- seqExampleFileName("gds")
@@ -334,7 +349,7 @@ g1 <- gdsfmt::openfn.gds(file)
 g1
 ```
 
-Specific functions in manipulating the GDS file: 
+Specific functions in manipulating the GDS file:
 
 Package | Function | Description
 --------|----------|--------------
@@ -392,7 +407,7 @@ file <- SeqArray::seqExampleFileName("gds")
 GDSArray(file, "genotype/data")
 ```
 Note that a GDSMatrix will be returned by default for 2-dimensional
-GDSArray. 
+GDSArray.
 
 ```{r, GDSMatrix}
 GDSArray(file, "phase/data")
@@ -432,7 +447,7 @@ ga[1:3, 10:15, ]
 ga[c(TRUE, FALSE), , ]
 ```
 
-some numeric calculation: 
+some numeric calculation:
 ```{r, GDSArray numeric}
 dp <- GDSArray(file, "annotation/format/DP/data")
 log(dp)
@@ -457,7 +472,7 @@ infrastructure and methods.
 
 
 ### Extension of `DelayedArray` with VCF backend
-	
+
 
 #### Introduction
 
@@ -484,7 +499,7 @@ second dimension being 'samples'. This feature is consistent with the
 assay data saved in SummarizedExperiment, and makes the [VCFArray][]
 package easily interoperable with other established _Bioconductor_
 data infrastructure.
- 
+
 [VCFArray]: https://bioconductor.org/packages/VCFArray
 
 #### vcfFields()
@@ -508,7 +523,7 @@ vcfFields(fl)
 
 We can construct the VCFArray object from the same input as
 `vcfFields()` methods (the character string for VCF file path,
-'VcfFile' object or 'RangedVcfStack' object). 
+'VcfFile' object or 'RangedVcfStack' object).
 
 With a simplest example, we can construct a VCFArray object for the
 'GT' data entry in the provided VCF file with arguments of 'file' and
@@ -593,7 +608,7 @@ familiar _R/Bioconductor_ paradigms, and supports smooth
 interoperability with widely used bioinformatics tools that are
 available on _R/Bioconductor_.
 
-[VariantExperiment]: https://bioconductor.org/packages/VariantExperiment 
+[VariantExperiment]: https://bioconductor.org/packages/VariantExperiment
 [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
 [SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment
 [SeqArray]: https://bioconductor.org/packages/SeqArray
@@ -674,7 +689,7 @@ are in DelayedDataFrame class (with column data in GDSArray format).
 ```{r, VE vcf accessors, eval=FALSE}
 assay(ve, 1)
 rowData(ve)
-``` 
+```
 
 For VCF input, Users could also have the opportunity to save the
 sample related annotation directly into the VariantExperiment object,
@@ -698,7 +713,7 @@ ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
 dim(ve2)
 ```
 For the above example, only 1000 variants are read into the
-VariantExperiment object, starting from the position of 101. 
+VariantExperiment object, starting from the position of 101.
 
 The support of VCFArray in the coercion method of
 `makeVariantExperimentFromVCF(useVCFArray = TRUE)` is under work
@@ -709,8 +724,8 @@ VCFArray objects.
 
 Since the VCFArray was written internally using [VariantAnnotation][],
 we are also working to enable the statistical funtions and methods on
-the VariantExperiment objects with VCF backends. 
-	
+the VariantExperiment objects with VCF backends.
+
 ### Basic methods
 
 Consistent with SummarizedExperiment, The VariantExperiment supports
@@ -780,7 +795,7 @@ ve2 <- loadVariantExperiment(dir=a)
 gdsfile(ve2)
 ```
 
-Now we are all set for any downstream analysis as needed. 
+Now we are all set for any downstream analysis as needed.
 
 ```{r, VE stats demo, eval=FALSE}
 head(hwe(ve1))
@@ -801,7 +816,7 @@ Here is a list of the statistical functions with brief description:
 statistical functions | Description
 --------------------- | ------------
 seqAlleleFreq         | Calculates the allele frequencies
-seqAlleleCount        | Calculates the allele counts 
+seqAlleleCount        | Calculates the allele counts
 seqMissing            | Calculates the missing rate for variant/sample
 seqNumAllele          | Calculates the number of alleles (for ref/alt allele)
 hwe                   | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants
@@ -814,7 +829,7 @@ countSingletons       | Count singleton variants for each sample
 heterozygosity        | Calculate heterozygosity rate by sample or by variants
 homozygosity          | Calculate homozygosity rate by sample or by variants
 meanBySample          | Calculate the mean value of a variable by sample over all variants
-isSNV                 | Flag a single nucleotide variant 
+isSNV                 | Flag a single nucleotide variant
 isVariant             | Locate which samples are variant for each site
 
 Here are some examples in calculating the sample missing rate, hwe,
@@ -839,8 +854,8 @@ countSingletons(ve)
 ```
 
 
-## DelayedDataFrame for delayed representation of metadata 
-	
+## DelayedDataFrame for delayed representation of metadata
+
 ### Introduction
 
 The development of [DelayedArray][] has made it possible for on-disk
@@ -860,7 +875,7 @@ _R/Bioconductor_ paradigm, and at the same time, all data (in the unit
 of columns) could be optionally saved on-disk (e.g., in
 [DelayedArray][] structure with any back-end). Common operations like
 constructing, subsetting, splitting, combining could be done using the
-familiarDataFrame metaphor. 
+familiarDataFrame metaphor.
 
 This feature of DelayedDataFrame could enable efficient on-disk
 reading and processing of the large-scale annotation files, and use
@@ -871,12 +886,12 @@ signicantly less memory than in-memory _R/Bioconductor_ alternatives.
 [DelayedArray]: https://bioconductor.org/packages/DelayedArray
 [GDSArray]: https://bioconductor.org/packages/GDSArray
 [HDF5Array]: https://bioconductor.org/packages/HDF5Array
-	
+
 ### DelayedDataFrame class
 
 DelayedDataFrame extends the DataFrame data structure, and has
 similar behavior in terms of construction, subsetting, splitting,
-combining, rownames setting, and etc.. 
+combining, rownames setting, and etc..
 
 The DelayedDataFrame columns supports the [DelayedArray][] (and
 direct extensions). Here we use 1-dimensional GDSArray objects as
@@ -952,7 +967,7 @@ lazyIndex(ddf)@index
 Whenever an operation is done (e.g., subsetting), the 'listData' slot
 inside the DelayedDataFrame stays the same, only the lazyIndex
 slot will be updated, so that the show method, further statistical
-calculation will be applied to the subsetting data set. 
+calculation will be applied to the subsetting data set.
 
 For example, here we subset the DelayedDataFrame object `ddf` to
 keep only the first 5 rows, and see how the lazyIndex works. As
@@ -969,11 +984,11 @@ lazyIndex(ddf1)
 nrow(ddf1)
 ```
 
-- coercion 
+- coercion
 
 Only when direct realization call is invoked, (e.g., `DataFrame()`, or
 `as.list()`, the lazyIndex will be realized and the object of new
-class returned. 
+class returned.
 
 For example, when DelayedDataFrame is coerced into a DataFrame object,
 the 'listData' slot will be updated according to the lazyIndex slot.
@@ -997,7 +1012,7 @@ lazyIndex(ddf2)
 lazyIndex(rbind(ddf1, ddf2))
 ```
 
-- cbind 
+- cbind
 
 While 'cbind'-ing DelayedDataFrame objects will keep all existing
 lazyIndex of input arguments and carry into the new DelayedDataFrame
@@ -1034,7 +1049,7 @@ RSQLite. But theoretically, we would implement SQLDataFrame so that
 users could choose to use different database backends, including
 MySQL, PostgreSQL, MariaDB, and Google's BigQuery, etc.
 
-[dbplyr]: https://cran.r-project.org/web/packages/dbplyr/ 
+[dbplyr]: https://cran.r-project.org/web/packages/dbplyr/
 [dplyr]: https://cran.r-project.org/web/packages/dplyr/
 
 ### SQLDataFrame class
@@ -1156,7 +1171,7 @@ head(geneids[["symbol"]])
 SQLDataFrame objects can be subsetted in a similar way of
 DataFrame following the usual _R_ conventions, with numeric,
 character or logical vectors; logical vectors are recycled to the
-appropriate length. 
+appropriate length.
 
 **NOTE**, use `drop=FALSE` explicitly for single column subsetting if
 you want to return a SQLDataFrame object, otherwise, the default
@@ -1169,7 +1184,7 @@ geneids[c(1, 5, 9), ]
 ```
 
 List style subsetting is also allowed to extract certain columns from
-the SQLDataFrame object which returns SQLDataFrame by default. 
+the SQLDataFrame object which returns SQLDataFrame by default.
 
 ```{r}
 geneids[2]  #> list style subsetting
@@ -1208,7 +1223,7 @@ from different sources.
 
 Here we use a simple example to show that how to use the gene ids to
 match the transcription ids using `*_join` functions, and return a
-table with matching positions for each transcripts. 
+table with matching positions for each transcripts.
 
 ```{r SQLDF_tx}
 dbname.tx <- dbfile(TxDb.Hsapiens.UCSC.hg38.knownGene)
@@ -1251,7 +1266,7 @@ from `union()` or `rbind()`. So when the `saveSQLDataFrame()` function
 was called, a database table will be written into a physical disk
 space and have the unique records.
 
-Accessor function is made avaible for this slot: 
+Accessor function is made avaible for this slot:
 ```{r}
 tblData(res)
 ```
@@ -1269,7 +1284,7 @@ dbname(res)
 dbtable(res)
 ```
 
-- dbnrows and dbconcatKey 
+- dbnrows and dbconcatKey
 
 The dbnrows slot saves the number of rows corresponding to the
 tblData, and dbconcatKey saves the realized (concatenated if multiple)
@@ -1281,7 +1296,7 @@ dbnrows(res)
 head(dbconcatKey(res))
 ```
 
-- indexes slot 
+- indexes slot
 
 The indexes slots is an unnamed list saving the row and column indexes
 respectively corresponding to the tblData, so that the SQLDataFrame