From 9cef2522923b869c6261db9d19bc5041577f58b1 Mon Sep 17 00:00:00 2001 From: Rob Patro Date: Mon, 12 Nov 2018 10:31:11 -0500 Subject: [PATCH] Update 04.body.md mostly messing up commas, but some small content edits. --- content/04.body.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/content/04.body.md b/content/04.body.md index e56cb20..0a87464 100644 --- a/content/04.body.md +++ b/content/04.body.md @@ -2,7 +2,7 @@ The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes across individuals, tissues and disease states -- resolving differences to the level of -individual cells. This dataset provides an extraordinary opportunity for scientific advancement enabled by new tools to rapidly query, characterize, and analyze these intrinsically +individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions that preserve the important attributes of the original high dimensional data and yield interpretable, searchable features. For transcriptomic data, compressing on the gene @@ -33,7 +33,7 @@ specifically selected as a base enabling technnology because its error distribut naturally account for measurement-specific technical variation [@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature quantifications or spatial information. For non-linear needs, neural networks with multiple -layers, provide a complementary path to low-dimensional representations +layers provide a complementary path to low-dimensional representations [@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will make use of substantial progress that has already been made in both linear and non-linear techniques (e.g., @@ -50,10 +50,10 @@ Second, we will improve techniques for fast and accurate quantification. Existin for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, etc.) do not account for reads mapping between multiple genes. This affects approximately 15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and -leads to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address +leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address this, we will build on our recently developed quantification method for tagged-end data that accounts for reads mapping to multiple genomic loci in a principled and consistent way -[@doi:10.1101/335000] (PI Patro) and extend this into a production quality tool for +[@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for scRNA-Seq preprocessing. Our tool will support: 1. Exploration of alternative models for Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality control and filtering using the UMI-resolution graph. 3. Creation of a compressed and @@ -68,7 +68,7 @@ tools produced will be fast, scalable, and memory-efficient by leveraging the av assets and expertises of the R/Bioconductor project (PIs Hicks & Love) as well as the broader HCA community. -By using and extending our base enabling technologies we will provide three principle +By using and extending our base enabling technologies, we will provide three principle tools and resources for the HCA. These include 1) software to enable fast and accurate search and annotation using low-dimensional representations of cellular features, 2) a versioned and annotated catalog of latent spaces corresponding to signatures of cell types, @@ -80,7 +80,7 @@ HCA in general. *Rationale:* The HCA provides a reference atlas to human cell types, states, and the biological processes in which they engage. The utility of the reference therefore requires -that one can easily compare references to each other or a new sample to the compendium of +that one can easily compare references to each other, or a new sample to the compendium of reference samples. Low-dimensional representations, because they compress the space, provide the building blocks for search approaches that can be practically applied across very large datasets such as the HCA. *We propose to develop algorithms and software for efficient @@ -106,7 +106,7 @@ It's a bit more specific than the rest of the paragraph -LAG]** Similarly to the approach by which comparisons to a reference genomes can identify specific differences in a genome of interest, we will use low-dimensional representations from latent -spaces to define a reference transcriptome map (the HCA) and use this to quantify +spaces to define a reference transcriptome map (the HCA), and use this to quantify differences in target transcriptome maps from new samples of interest. We will leverage common low-dimensional representations and cell-to-cell correlation structure both within and across transcriptome maps from Aim 2 to define this reference. Quantifying the @@ -144,9 +144,9 @@ representations capture generalizable biological features. We have developed new learning methods to quantify the extent to which latent space representations from one set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] (PIs Greene, Goff & Fertig). -These provide a strong foundation to compare low-dimensional -representations across different low-dimensional data representation technniques. -[**<-- too much repitition here?**] +These provide a strong foundation to compare different low-dimensional representations +and techniques for learning and transferring knowledge between them [**<-- didn't understand +what was here before too well, please make sure I didn't muck with the meaning too much.**] Generalizable representations should transfer across datasets of related biological contexts, while representations of noise will not. In addition, we have found that combining multiple representations can better capture biological processes across scales @@ -202,7 +202,7 @@ course material on single cell gene expression profiling (PI Goff), 2) machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional representations, and 5) tools developed by our group in response to this RFA. -Second, the short course will train not only students, but instructors. Our one-on-one +Second, the short course will train not only students, but also instructors. Our one-on-one approach to course projects will require a high instructor-to-student ratio. We will therefore recruit former participants of this class to return in subsequent years, first as teaching assistants, and later as module presenters. We have found that course alumni are