diff --git a/DESCRIPTION b/DESCRIPTION index c6b7f30..675349f 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,5 +1,6 @@ Package: BiocWorkshops2019 Type: Package +Encoding: UTF-8 Title: Bioconductor 2019 Conference Workshops Version: 3.9.1 Description: Lazy representation of very large genomic Data resources in R/Bioconductor diff --git a/vignettes/Bioc_201_herveQian_LazyRep.Rmd b/vignettes/Bioc_201_herveQian_LazyRep.Rmd index 28ca96b..cf723d3 100644 --- a/vignettes/Bioc_201_herveQian_LazyRep.Rmd +++ b/vignettes/Bioc_201_herveQian_LazyRep.Rmd @@ -41,9 +41,9 @@ options(showTailLines=3) ## Workshop Description In this workshop, we will learn about _Bioconductor_ data containers that -use lazy data representations and their application in real data analysis. -We will use some real data examples generated from DNA/RNA-seq, to -demonstrate the representation and comprehension of large scale +use lazy data representations and about their application in real data +analysis. We will use some real data examples generated from DNA/RNA-seq, +to demonstrate the representation and comprehension of large scale (out-of-memory) genomic datasets. The workshop will be mainly instructor-led live demo with completely working examples. Instructions and notes will be included. @@ -62,23 +62,23 @@ code chunks. ## Time outline -Activity | Time ----------|------ -DelayedArray and HDF5Array objects | 40m -Specialized DelayedArray backends | 10m -VariantExperiment | 20m -DelayedDataFrame | 10m -SQLDataFrame | 10m -Future directions | 5-10m - +| Activity | Time | +|----------|------| +| DelayedArray and HDF5Array objects | 40m | +| Specialized DelayedArray backends | 10m | +| VariantExperiment | 20m | +| DelayedDataFrame | 10m | +| SQLDataFrame | 10m | +| Future directions | 5-10m | + ## Learning objectives -- Create `DelayedArray` with different backends (**to be modified by Hervé**) -- Use `VariantExperiment` to represent your DNA-seq data. +- Introduction to `DelayedArray` objects and related concepts (seed, delayed operations, realization, block-processed operations, in-memory vs on-disk backends, etc...) +- Use `VariantExperiment` to represent your DNA-seq data. - Statistical analysis of variant data (VCF) using `VariantExperiment`. -- Create `SQLDataFrame` from in-memory data, text format file, and database tables. +- Create `SQLDataFrame` from in-memory data, text format file, and database tables. - Use `SQLDataFrame` to represent and manipulate godb database tables. - -## _R/Bioconductor_ packages + +## _R/Bioconductor_ packages to-be-used in this workshop: @@ -92,12 +92,12 @@ library(SeqArray) library(GDSArray) library(VCFArray) library(DelayedDataFrame) -library(SQLDataFrame) +library(SQLDataFrame) library(VariantAnnotation) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(DBI) -#> library(VariantExperiment) +#> library(VariantExperiment) ``` @@ -140,7 +140,7 @@ an HDF5 dataset). Here is an example of a DelayedArray object where the seed is an HDF5 dataset: -```{r} +```{r, HDF5Array} library(HDF5Array) A2 <- as(a, "HDF5Array") # write 'a' to an HDF5 file A2 @@ -194,7 +194,7 @@ identical(seed(A2), seed(A3)) Note that `A3`'s memory footprint is still very small because all the data is still on disk: -``` +```{r} object.size(A3) showtree(A3) @@ -227,18 +227,23 @@ we perform on the object. Summary of _delayed_ operations supported on DelayedArray objects: -Operation | Comment -----------|------ -`rbind`, `cbind` | none -All the members of the `Ops` group | e.g. `+`, `-`, `==`, `<`, etc... -All the members of the `Math` and `Math2` groups | e.g. `log`, `floor`, etc... -`sweep` | none -`!` | none -`is.na`, `is.finite`, `is.infinite`, `is.nan` | none -`lengths` | only meaningful when object is of type `list` -`nchar`, `tolower`, `toupper`, `grepl`, `sub`, `gsub` | none -`pmax2` and `pmin2` | replacements for `base::pmax` and `base::pmin` -`dnorm`, `dbinom`, `dpois`, `dlogis` | ... and related functions +| Operations | Note | +|------------|------| +| `x[i_1, i_2, ..., i_n]` | _Multi-dimensional_ single bracket subsetting | +| `t`, `aperm` | | +| `rbind`, `cbind` | | +| `sweep` | | +| All the members of the `Ops` group | e.g. `+`, `-`, `==`, `<`, etc... | +| All the members of the `Math` and `Math2` groups | e.g. `log`, `floor`, etc... | +| `is.na`, `is.finite`, `is.infinite`, `is.nan` | | +| `pmax2` and `pmin2` | replacements for `base::pmax` and `base::pmin` | +| `dnorm`, `pnorm`, `qnorm` | The Normal Distribution | +| `dbinom`, `pbinom`, `qbinom` | The Binomial Distribution | +| `dpois`, `ppois`, `qpois` | The Poisson Distribution | +| `dlogis`, `plogis`, `qlogis` | The Logistic Distribution | +| `nchar`, `tolower`, `toupper`, `grepl`, `sub`, `gsub` | | +| `!` | | +| `lengths` | only meaningful when object is of type `list` | #### Realization @@ -248,16 +253,26 @@ TODO (by Hervé) When operations are not _delayed_ they are _block-processed_. -TODO (by Hervé): expnd this +TODO (by Hervé): expand this Summary of _block-processed_ operations supported on DelayedArray objects: -Operation | Comment -----------|-------- - | +| Operations | Note | +|------------|------| +| `x[i]` | _Linear_ single bracket subsetting | +| `anyNA`, `which` | | +| `unique`, `table` | | +| All the members of the `Summary` group | e.g. `max`, `sum`, `all`, etc... | +| `mean` | | +| `apply` | | +| `row/colSums`, `row/colMeans` | Base row/col summarization | +| `row/colMaxs`, `row/colProds`, etc... | Extended row/col summarization | +| `rowsum`, `colsum` | | + +Note: the _Extended row/col summarization_ functions are defined in the +matrixStats package and the methods for DelayedMatrix objects are implemented +in the [DelayedMatrixStats package](https://bioconductor.org/packages/DelayedMatrixStats). -TODO (by Hervé): Refer to _DelayedMatrixStats_ package for more matrix -summarization operations on DelayedMatrix objects. ### In-memory vs on-disk objects @@ -326,7 +341,7 @@ be found [here](https://www.biostat.washington.edu/sites/default/files/modules/G [SNPRelate]: https://bioconductor.org/packages/SNPRelate [SeqArray]: https://bioconductor.org/packages/SeqArray -The GDS file inside R uses hierarchical structure: +The GDS file inside R uses hierarchical structure: ```{r, GDSopen} file <- seqExampleFileName("gds") @@ -334,7 +349,7 @@ g1 <- gdsfmt::openfn.gds(file) g1 ``` -Specific functions in manipulating the GDS file: +Specific functions in manipulating the GDS file: Package | Function | Description --------|----------|-------------- @@ -392,7 +407,7 @@ file <- SeqArray::seqExampleFileName("gds") GDSArray(file, "genotype/data") ``` Note that a GDSMatrix will be returned by default for 2-dimensional -GDSArray. +GDSArray. ```{r, GDSMatrix} GDSArray(file, "phase/data") @@ -432,7 +447,7 @@ ga[1:3, 10:15, ] ga[c(TRUE, FALSE), , ] ``` -some numeric calculation: +some numeric calculation: ```{r, GDSArray numeric} dp <- GDSArray(file, "annotation/format/DP/data") log(dp) @@ -457,7 +472,7 @@ infrastructure and methods. ### Extension of `DelayedArray` with VCF backend - + #### Introduction @@ -484,7 +499,7 @@ second dimension being 'samples'. This feature is consistent with the assay data saved in SummarizedExperiment, and makes the [VCFArray][] package easily interoperable with other established _Bioconductor_ data infrastructure. - + [VCFArray]: https://bioconductor.org/packages/VCFArray #### vcfFields() @@ -508,7 +523,7 @@ vcfFields(fl) We can construct the VCFArray object from the same input as `vcfFields()` methods (the character string for VCF file path, -'VcfFile' object or 'RangedVcfStack' object). +'VcfFile' object or 'RangedVcfStack' object). With a simplest example, we can construct a VCFArray object for the 'GT' data entry in the provided VCF file with arguments of 'file' and @@ -593,7 +608,7 @@ familiar _R/Bioconductor_ paradigms, and supports smooth interoperability with widely used bioinformatics tools that are available on _R/Bioconductor_. -[VariantExperiment]: https://bioconductor.org/packages/VariantExperiment +[VariantExperiment]: https://bioconductor.org/packages/VariantExperiment [SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment [SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment [SeqArray]: https://bioconductor.org/packages/SeqArray @@ -674,7 +689,7 @@ are in DelayedDataFrame class (with column data in GDSArray format). ```{r, VE vcf accessors, eval=FALSE} assay(ve, 1) rowData(ve) -``` +``` For VCF input, Users could also have the opportunity to save the sample related annotation directly into the VariantExperiment object, @@ -698,7 +713,7 @@ ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000) dim(ve2) ``` For the above example, only 1000 variants are read into the -VariantExperiment object, starting from the position of 101. +VariantExperiment object, starting from the position of 101. The support of VCFArray in the coercion method of `makeVariantExperimentFromVCF(useVCFArray = TRUE)` is under work @@ -709,8 +724,8 @@ VCFArray objects. Since the VCFArray was written internally using [VariantAnnotation][], we are also working to enable the statistical funtions and methods on -the VariantExperiment objects with VCF backends. - +the VariantExperiment objects with VCF backends. + ### Basic methods Consistent with SummarizedExperiment, The VariantExperiment supports @@ -780,7 +795,7 @@ ve2 <- loadVariantExperiment(dir=a) gdsfile(ve2) ``` -Now we are all set for any downstream analysis as needed. +Now we are all set for any downstream analysis as needed. ```{r, VE stats demo, eval=FALSE} head(hwe(ve1)) @@ -801,7 +816,7 @@ Here is a list of the statistical functions with brief description: statistical functions | Description --------------------- | ------------ seqAlleleFreq | Calculates the allele frequencies -seqAlleleCount | Calculates the allele counts +seqAlleleCount | Calculates the allele counts seqMissing | Calculates the missing rate for variant/sample seqNumAllele | Calculates the number of alleles (for ref/alt allele) hwe | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants @@ -814,7 +829,7 @@ countSingletons | Count singleton variants for each sample heterozygosity | Calculate heterozygosity rate by sample or by variants homozygosity | Calculate homozygosity rate by sample or by variants meanBySample | Calculate the mean value of a variable by sample over all variants -isSNV | Flag a single nucleotide variant +isSNV | Flag a single nucleotide variant isVariant | Locate which samples are variant for each site Here are some examples in calculating the sample missing rate, hwe, @@ -839,8 +854,8 @@ countSingletons(ve) ``` -## DelayedDataFrame for delayed representation of metadata - +## DelayedDataFrame for delayed representation of metadata + ### Introduction The development of [DelayedArray][] has made it possible for on-disk @@ -860,7 +875,7 @@ _R/Bioconductor_ paradigm, and at the same time, all data (in the unit of columns) could be optionally saved on-disk (e.g., in [DelayedArray][] structure with any back-end). Common operations like constructing, subsetting, splitting, combining could be done using the -familiarDataFrame metaphor. +familiarDataFrame metaphor. This feature of DelayedDataFrame could enable efficient on-disk reading and processing of the large-scale annotation files, and use @@ -871,12 +886,12 @@ signicantly less memory than in-memory _R/Bioconductor_ alternatives. [DelayedArray]: https://bioconductor.org/packages/DelayedArray [GDSArray]: https://bioconductor.org/packages/GDSArray [HDF5Array]: https://bioconductor.org/packages/HDF5Array - + ### DelayedDataFrame class DelayedDataFrame extends the DataFrame data structure, and has similar behavior in terms of construction, subsetting, splitting, -combining, rownames setting, and etc.. +combining, rownames setting, and etc.. The DelayedDataFrame columns supports the [DelayedArray][] (and direct extensions). Here we use 1-dimensional GDSArray objects as @@ -952,7 +967,7 @@ lazyIndex(ddf)@index Whenever an operation is done (e.g., subsetting), the 'listData' slot inside the DelayedDataFrame stays the same, only the lazyIndex slot will be updated, so that the show method, further statistical -calculation will be applied to the subsetting data set. +calculation will be applied to the subsetting data set. For example, here we subset the DelayedDataFrame object `ddf` to keep only the first 5 rows, and see how the lazyIndex works. As @@ -969,11 +984,11 @@ lazyIndex(ddf1) nrow(ddf1) ``` -- coercion +- coercion Only when direct realization call is invoked, (e.g., `DataFrame()`, or `as.list()`, the lazyIndex will be realized and the object of new -class returned. +class returned. For example, when DelayedDataFrame is coerced into a DataFrame object, the 'listData' slot will be updated according to the lazyIndex slot. @@ -997,7 +1012,7 @@ lazyIndex(ddf2) lazyIndex(rbind(ddf1, ddf2)) ``` -- cbind +- cbind While 'cbind'-ing DelayedDataFrame objects will keep all existing lazyIndex of input arguments and carry into the new DelayedDataFrame @@ -1034,7 +1049,7 @@ RSQLite. But theoretically, we would implement SQLDataFrame so that users could choose to use different database backends, including MySQL, PostgreSQL, MariaDB, and Google's BigQuery, etc. -[dbplyr]: https://cran.r-project.org/web/packages/dbplyr/ +[dbplyr]: https://cran.r-project.org/web/packages/dbplyr/ [dplyr]: https://cran.r-project.org/web/packages/dplyr/ ### SQLDataFrame class @@ -1156,7 +1171,7 @@ head(geneids[["symbol"]]) SQLDataFrame objects can be subsetted in a similar way of DataFrame following the usual _R_ conventions, with numeric, character or logical vectors; logical vectors are recycled to the -appropriate length. +appropriate length. **NOTE**, use `drop=FALSE` explicitly for single column subsetting if you want to return a SQLDataFrame object, otherwise, the default @@ -1169,7 +1184,7 @@ geneids[c(1, 5, 9), ] ``` List style subsetting is also allowed to extract certain columns from -the SQLDataFrame object which returns SQLDataFrame by default. +the SQLDataFrame object which returns SQLDataFrame by default. ```{r} geneids[2] #> list style subsetting @@ -1208,7 +1223,7 @@ from different sources. Here we use a simple example to show that how to use the gene ids to match the transcription ids using `*_join` functions, and return a -table with matching positions for each transcripts. +table with matching positions for each transcripts. ```{r SQLDF_tx} dbname.tx <- dbfile(TxDb.Hsapiens.UCSC.hg38.knownGene) @@ -1251,7 +1266,7 @@ from `union()` or `rbind()`. So when the `saveSQLDataFrame()` function was called, a database table will be written into a physical disk space and have the unique records. -Accessor function is made avaible for this slot: +Accessor function is made avaible for this slot: ```{r} tblData(res) ``` @@ -1269,7 +1284,7 @@ dbname(res) dbtable(res) ``` -- dbnrows and dbconcatKey +- dbnrows and dbconcatKey The dbnrows slot saves the number of rows corresponding to the tblData, and dbconcatKey saves the realized (concatenated if multiple) @@ -1281,7 +1296,7 @@ dbnrows(res) head(dbconcatKey(res)) ``` -- indexes slot +- indexes slot The indexes slots is an unnamed list saving the row and column indexes respectively corresponding to the tblData, so that the SQLDataFrame