From 5d232e115fb09029974c01e826fb7aacb0e6707c Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Mon, 12 Nov 2018 14:37:56 -0500 Subject: [PATCH] try to finesse the aim 1 shared space --- content/04.body.md | 308 ++++++++++++++++++++++----------------------- 1 file changed, 152 insertions(+), 156 deletions(-) diff --git a/content/04.body.md b/content/04.body.md index 0a87464..91052b6 100644 --- a/content/04.body.md +++ b/content/04.body.md @@ -1,220 +1,216 @@ ## Proposal Body (2000 words) -The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes -across individuals, tissues and disease states -- resolving differences to the level of -individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically -high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions +The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes +across individuals, tissues and disease states -- resolving differences to the level of +individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically +high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions that preserve the important attributes of the original high dimensional data and yield -interpretable, searchable features. For transcriptomic data, compressing on the gene -dimension is most attractive: it can be applied to single samples, and genes often provide +interpretable, searchable features. For transcriptomic data, compressing on the gene +dimension is most attractive: it can be applied to single samples, and genes often provide information about other co-regulated genes or cellular attributes. We hypothesize that building an ensemble of low dimensional representations across latent space methods will provide a -reduced dimensional space that captures biological sources of variability and is robust to measurement noise. Our seed network will -incorporate biologists and computer scientists from five leading academic institutions who will work together to create foundational technologies +reduced dimensional space that captures biological sources of variability and is robust to measurement noise. Our seed network will +incorporate biologists and computer scientists from five leading academic institutions who will work together to create foundational technologies and educational opportunities that promote effective interpretation of low dimensional representations of HCA data. We will continue our active collaborations with other members of the broader HCA network to integrate state of the art latent space tools, portals, and annotations to enable biological utilization of HCA data through latent spaces. ## Scientific Goals -We will create low-dimensional representations that provide search and catalog capabilities -for the HCA. Given both the scale of data, and the inherent complexity of biological -systems, we believe these approaches are crucial to the long term success of the HCA. Our -**__central hypothesis__** is that these approaches will enable faster algorithms while -reducing the influence of technical noise. We propose to advance **__base enabling +We will create low-dimensional representations that provide search and catalog capabilities +for the HCA. Given both the scale of data, and the inherent complexity of biological +systems, we believe these approaches are crucial to the long term success of the HCA. Our +**__central hypothesis__** is that these approaches will enable faster algorithms while +reducing the influence of technical noise. We propose to advance **__base enabling technologies__** for low-dimensional representations. -First, we will identify techniques that learn interpretable, biologically-aligned -representations. We will consider both linear and non-linear techniques as each may identify +First, we will identify techniques that learn interpretable, biologically-aligned +representations. We will consider both linear and non-linear techniques as each may identify distinct components of biological systems. For linear techniques, we rely on our Bayesian, non-negative matrix factorization method scCoGAPS [@doi:10.1101/378950,@doi:10.1101/395004] -(PIs Fertig & Goff). This technique learns biologically relevant features across contexts -and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], -including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is -specifically selected as a base enabling technnology because its error distribution can -naturally account for measurement-specific technical variation -[@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature -quantifications or spatial information. For non-linear needs, neural networks with multiple -layers provide a complementary path to low-dimensional representations -[@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will -make use of substantial progress that has already been made in both linear and non-linear +(PIs Fertig & Goff). This technique learns biologically relevant features across contexts +and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], +including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is +specifically selected as a base enabling technnology because its error distribution can +naturally account for measurement-specific technical variation +[@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature +quantifications or spatial information. For non-linear needs, neural networks with multiple +layers provide a complementary path to low-dimensional representations +[@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will +make use of substantial progress that has already been made in both linear and non-linear techniques (e.g., [@doi:10.1101/300681,@doi:10.1101/292037,@doi:10.1101/237065,@doi:10.1101/315556,@doi:10.1101/457879,@doi:10.1016/j.cell.2017.10.023,@doi:10.7717/peerj.2888,@doi:10.1101/459891]). -and rigorously evaluate emerging methods into our search and catalog tools. We will extend -transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to -enable rapid integration, interpretation, and annotation of learned latent spaces. The +and rigorously evaluate emerging methods into our search and catalog tools. We will extend +transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to +enable rapid integration, interpretation, and annotation of learned latent spaces. The latent space team from the HCA collaborative networks RFA (including PIs Fertig, Goff, -Greene, and Patro) is establishing common definitions and requirements for latent spaces -for the HCA, as well as standardized output formats for low-dimensional representations from +Greene, and Patro) is establishing common definitions and requirements for latent spaces +for the HCA, as well as standardized output formats for low-dimensional representations from distinct classes of methods. -Second, we will improve techniques for fast and accurate quantification. Existing approaches -for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, -etc.) do not account for reads mapping between multiple genes. This affects approximately -15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and -leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address -this, we will build on our recently developed quantification method for tagged-end data that -accounts for reads mapping to multiple genomic loci in a principled and consistent way -[@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for -scRNA-Seq preprocessing. Our tool will support: 1. Exploration of alternative models for -Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality -control and filtering using the UMI-resolution graph. 3. Creation of a compressed and -indexible data structure for the UMI-resolution graph to enable direct access, query, and +Second, we will improve techniques for fast and accurate quantification. Existing approaches +for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, +etc.) do not account for reads mapping between multiple genes. This affects approximately +15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and +leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address +this, we will build on our recently developed quantification method for tagged-end data that +accounts for reads mapping to multiple genomic loci in a principled and consistent way +[@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for +scRNA-Seq preprocessing. Our tool will support: 1. Exploration of alternative models for +Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality +control and filtering using the UMI-resolution graph. 3. Creation of a compressed and +indexible data structure for the UMI-resolution graph to enable direct access, query, and fast search prior to secondary analysis. We will implement these base enabling technologies and methods for search, -analysis, and latent space transformations as freely available, open source software tools. -We will additionally develop platform-agnostic input and output data formats and standards -for latent space representations of the HCA data to maximize interoperability. The software -tools produced will be fast, scalable, and memory-efficient by leveraging the available -assets and expertises of the R/Bioconductor project (PIs Hicks & Love) as well as the +analysis, and latent space transformations as freely available, open source software tools. +We will additionally develop platform-agnostic input and output data formats and standards +for latent space representations of the HCA data to maximize interoperability. The software +tools produced will be fast, scalable, and memory-efficient by leveraging the available +assets and expertises of the R/Bioconductor project (PIs Hicks & Love) as well as the broader HCA community. -By using and extending our base enabling technologies, we will provide three principle -tools and resources for the HCA. These include 1) software to enable fast and accurate -search and annotation using low-dimensional representations of cellular features, 2) a -versioned and annotated catalog of latent spaces corresponding to signatures of cell types, -states, and biological attributes across the the HCA, and 3) short course and educational -materials that will increase the use and impact of low-dimensional representations and the +By using and extending our base enabling technologies, we will provide three principle +tools and resources for the HCA. These include 1) software to enable fast and accurate +search and annotation using low-dimensional representations of cellular features, 2) a +versioned and annotated catalog of latent spaces corresponding to signatures of cell types, +states, and biological attributes across the the HCA, and 3) short course and educational +materials that will increase the use and impact of low-dimensional representations and the HCA in general. ### Aim 1 -*Rationale:* The HCA provides a reference atlas to human cell types, states, and the -biological processes in which they engage. The utility of the reference therefore requires -that one can easily compare references to each other, or a new sample to the compendium of -reference samples. Low-dimensional representations, because they compress the space, provide -the building blocks for search approaches that can be practically applied across very large -datasets such as the HCA. *We propose to develop algorithms and software for efficient +*Rationale:* The HCA provides a reference atlas to human cell types, states, and the +biological processes in which they engage. The utility of the reference therefore requires +that one can easily compare references to each other, or a new sample to the compendium of +reference samples. Low-dimensional representations, because they compress the space, provide +the building blocks for search approaches that can be practically applied across very large +datasets such as the HCA. *We propose to develop algorithms and software for efficient search over the HCA using low-dimensional representations.* The primary approach to search in low-dimensional spaces is straightforward: one -must create an appropriate low-dimensional representation and identify distance functions -that enable biologically meaningful comparisons. Ideal low-dimensional representations are -predicted to be much faster to search, and potentially more biologically relevant, as noise -can be removed. In this aim, we will evaluate novel low-dimensional representations to -identify those with optimal qualities of compression, noise reduction, and retention of -biologically meangful features. Current scRNA-Seq approaches require investigators to -perform gene-level quantification on the entirety of a new sample. We aim to enable search -during sample preprocessing, prior to gene-level quantification. This will enable in-line -annotation of cell types and states and identification of novel features as samples are -being processed. We will implement and evaluate techniques to learn and transfer shared -low-dimensional representations between the UMI-resolution graph and quantified samples, so -that samples where either component is available can be used for search and annotation -**[CASEY ADD SHARED LATENT SPACE REF]**. These UMI-graphs will be embedded in the prior of -scCoGAPS and architecture of non-linear latent space techniques. **[Do we need this line? -It's a bit more specific than the rest of the paragraph -LAG]** -**[I think we need something to link in how this fits to the latent space methods -- maybe not so specific, but something that ties it back beyond preprocessing - EJF]** - -Similarly to the approach by which comparisons to a reference genomes can identify specific -differences in a genome of interest, we will use low-dimensional representations from latent -spaces to define a reference transcriptome map (the HCA), and use this to quantify -differences in target transcriptome maps from new samples of interest. We will leverage -common low-dimensional representations and cell-to-cell correlation structure both within -and across transcriptome maps from Aim 2 to define this reference. Quantifying the -differences between samples characterized at the single-cell level reveals population or -individual level differences. +must create an appropriate low-dimensional representation and identify distance functions +that enable biologically meaningful comparisons. Ideal low-dimensional representations are +predicted to be much faster to search, and potentially more biologically relevant, as noise +can be removed. In this aim, we will evaluate novel low-dimensional representations to +identify those with optimal qualities of compression, noise reduction, and retention of +biologically meaningful features. Current scRNA-Seq approaches require investigators to +perform gene-level quantification on the entirety of a new sample. We aim to enable search +during sample preprocessing, prior to gene-level quantification. This will enable in-line +annotation of cell types and states and identification of novel features as samples are +being processed. We will implement and evaluate techniques to learn and transfer shared +low-dimensional representations between read-based data (e.g., kmer representations) and quantified samples, so +that samples where either quantified or read data is available can be used for search and annotation +[@url:https://github.com/greenelab/shared-latent-space]. + +Similarly to the approach by which comparisons to a reference genomes can identify specific +differences in a genome of interest, we will use low-dimensional representations from latent +spaces to define a reference transcriptome map (the HCA), and use this to quantify +differences in target transcriptome maps from new samples of interest. We will leverage +common low-dimensional representations and cell-to-cell correlation structure both within +and across transcriptome maps from Aim 2 to define this reference. Quantifying the +differences between samples characterized at the single-cell level reveals population or +individual level differences. **[<-- I'm not sure what this sentence means. Please clarify. - LAG]** **[My take is that it means if we have an average from the catalogue we've built for a cell type or state, that deviations in particular samples could yield context-specific differences, not sure how to reword - EJF]** -Comparison of scRNA-seq maps from individuals with a particular phenotype -to the HCA reference, which is computationally infeasible from the large scale of HCA data, +Comparison of scRNA-seq maps from individuals with a particular phenotype +to the HCA reference, which is computationally infeasible from the large scale of HCA data, becomes tractable in these low dimensional spaces. We (PI Hicks) have extensive experience dealing with the distributions of cell expression within and between individuals [@pmid:26040460], which will be critical for defining an appropriate -metric to compare references in latent spaces. We plan to implement and evaluate -linear mixed models to account for the correlation structure within and between -transcriptome maps. This statistical method will be fast, memory-efficient and will +metric to compare references in latent spaces. We plan to implement and evaluate +linear mixed models to account for the correlation structure within and between +transcriptome maps. This statistical method will be fast, memory-efficient and will be scalable to billions of cells using low-dimensional representations. ### Aim 2 *Rationale:* Biological systems are comprised of diverse cell types and states with overlapping molecular phenotypes. Furthermore, biological processes are often reused with -modifications across cell types. Low-dimensional representations can identify these shared -features, independent of total distance between cells in gene expression space, across large -collections of data including the HCA. We will evaluate and select methods that define -latent spaces that reflect discrete biological processes or cellular features. These latent -spaces can be shared across different biological systems and can reveal context-specific -divergence such as pathogenic differences in disease. *We propose to establish a central -catalog of cell types, states, and biological processes derived from low-dimensional +modifications across cell types. Low-dimensional representations can identify these shared +features, independent of total distance between cells in gene expression space, across large +collections of data including the HCA. We will evaluate and select methods that define +latent spaces that reflect discrete biological processes or cellular features. These latent +spaces can be shared across different biological systems and can reveal context-specific +divergence such as pathogenic differences in disease. *We propose to establish a central +catalog of cell types, states, and biological processes derived from low-dimensional representations of the HCA.* -Establishing a catalog of cellular features using low-dimensional representations can -reduce noise and aid in biological interpretability. However, there are currently no -standardized, quantitative metrics to determine the extent to which low-dimensional -representations capture generalizable biological features. We have developed new transfer +Establishing a catalog of cellular features using low-dimensional representations can +reduce noise and aid in biological interpretability. However, there are currently no +standardized, quantitative metrics to determine the extent to which low-dimensional +representations capture generalizable biological features. We have developed new transfer learning methods to quantify the extent to which latent space representations from one -set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] +set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] (PIs Greene, Goff & Fertig). These provide a strong foundation to compare different low-dimensional representations and techniques for learning and transferring knowledge between them [**<-- didn't understand what was here before too well, please make sure I didn't muck with the meaning too much.**] -Generalizable representations should transfer across datasets of related biological -contexts, while representations of noise will not. In addition, we have found that combining -multiple representations can better capture biological processes across scales -[@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, -valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will -establish a catalog consisting of low-dimensional features learned across both +Generalizable representations should transfer across datasets of related biological +contexts, while representations of noise will not. In addition, we have found that combining +multiple representations can better capture biological processes across scales +[@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, +valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will +establish a catalog consisting of low-dimensional features learned across both linear and non-linear methods from our base enabling technologies and proposed extensions in Aim 1. -We will package and version low-dimensional representations and annotate these -representations based on their corresponding celluar features (e.g. cell type, tissue, -biological process) and deliver these as structured data objects in Bioconductor as well as -platform-agnostic data formats. Where applicable, we will leverage the computational tools -previously developed by Bioconductor for single-cell data access to the HCA, data -representation (`SingleCellExperiment`, `beachmat`, `LinearEmbeddingMatrix`, `DelayedArray`, -`HDF5Array` and `rhdf5`) and data assessment and amelioration of data quality (`scater`, -`scran`, `DropletUtils`). We are core package developers and -power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of -these materials via the *AnnotationHub* framework. To enable reproducible research -leveraging HCA, we will implement a content-based versioning system, -which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide +We will package and version low-dimensional representations and annotate these +representations based on their corresponding celluar features (e.g. cell type, tissue, +biological process) and deliver these as structured data objects in Bioconductor as well as +platform-agnostic data formats. Where applicable, we will leverage the computational tools +previously developed by Bioconductor for single-cell data access to the HCA, data +representation (`SingleCellExperiment`, `beachmat`, `LinearEmbeddingMatrix`, `DelayedArray`, +`HDF5Array` and `rhdf5`) and data assessment and amelioration of data quality (`scater`, +`scran`, `DropletUtils`). We are core package developers and +power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of +these materials via the *AnnotationHub* framework. To enable reproducible research +leveraging HCA, we will implement a content-based versioning system, +which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide sequences using a hash function. We (PIs Love and Patro) previously developed hash-based versioning and provenance -detection framework for bulk RNA-seq that supports reproducible -computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. +detection framework for bulk RNA-seq that supports reproducible +computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. Our versioning and dissemination of reference cell type catalogs -will help to avoid scenarios where researchers report on matches to a certain cell type in -HCA without precisely defining which definition of that cell type. We will develop -*F1000Research* workflows demonstrating how HCA-defined reference cell types and tools -developed in this RFA can be used within a typical genomic data analysis. This catalogue -will be used as the basis of defining the references for cell type and state, or +will help to avoid scenarios where researchers report on matches to a certain cell type in +HCA without precisely defining which definition of that cell type. We will develop +*F1000Research* workflows demonstrating how HCA-defined reference cell types and tools +developed in this RFA can be used within a typical genomic data analysis. This catalogue +will be used as the basis of defining the references for cell type and state, or individual-specific differences with the linear models proposed in Aim 1. ### Aim 3 -*Rationale:* Low-dimensional representations of scRNA-seq and HCA data make tasks faster and -provide interpretable summaries of complex high-dimensional cellular features. The HCA -data-associated methods and workflows will be valuable to many biomedical fields, but their -use will require an understanding of basic bioinformatics, scRNA-Seq, and how the tools -being developed work. Furthermore, researchers will need exposure to the conceptual basis of -low-dimensional interpretations of biological systems. This aim addresses these needs in +*Rationale:* Low-dimensional representations of scRNA-seq and HCA data make tasks faster and +provide interpretable summaries of complex high-dimensional cellular features. The HCA +data-associated methods and workflows will be valuable to many biomedical fields, but their +use will require an understanding of basic bioinformatics, scRNA-Seq, and how the tools +being developed work. Furthermore, researchers will need exposure to the conceptual basis of +low-dimensional interpretations of biological systems. This aim addresses these needs in three ways. -First, we will develop a bioinformatic training program for biologists at all levels, -including those with no experience in bioinformatics. Lecture materials will be extended -from existing materials from previous bioinformatic courses we (PI Hampton) have run at -Mount Desert Island Biological Laboratory, the University of Birmingham, UK, and Geisel -School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists -in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of -the success of these learning experiences has to do with our instructional paradigm, which -includes a very challenging course project coupled with one-on-one support from instructors. -We will develop a new curriculum specifically tailored to HCA that incorporates: 1) didactic -course material on single cell gene expression profiling (PI Goff), 2) -machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional +First, we will develop a bioinformatic training program for biologists at all levels, +including those with no experience in bioinformatics. Lecture materials will be extended +from existing materials from previous bioinformatic courses we (PI Hampton) have run at +Mount Desert Island Biological Laboratory, the University of Birmingham, UK, and Geisel +School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists +in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of +the success of these learning experiences has to do with our instructional paradigm, which +includes a very challenging course project coupled with one-on-one support from instructors. +We will develop a new curriculum specifically tailored to HCA that incorporates: 1) didactic +course material on single cell gene expression profiling (PI Goff), 2) +machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional representations, and 5) tools developed by our group in response to this RFA. -Second, the short course will train not only students, but also instructors. Our one-on-one -approach to course projects will require a high instructor-to-student ratio. We will -therefore recruit former participants of this class to return in subsequent years, first as -teaching assistants, and later as module presenters. We have found that course alumni are -eager to improve their teaching resumes, that they learn the material in a new way as they -begin to teach it, and that they are an invaluable resource in understanding how to improve -the course over time. Part of our strategy is to support this community, which includes many -people who will drive the next wave of innovation. All of our course materials will be -freely available, enabling course participants to bring what they learned home with them. A -capstone session will be included in which we will provide suggestions about how the -materials presented in the course can be incorporated into existing course curricula. Course -faculty will be available to assist with integration effort after the course. Finally, the short course will facilitate scientific collaborations +Second, the short course will train not only students, but also instructors. Our one-on-one +approach to course projects will require a high instructor-to-student ratio. We will +therefore recruit former participants of this class to return in subsequent years, first as +teaching assistants, and later as module presenters. We have found that course alumni are +eager to improve their teaching resumes, that they learn the material in a new way as they +begin to teach it, and that they are an invaluable resource in understanding how to improve +the course over time. Part of our strategy is to support this community, which includes many +people who will drive the next wave of innovation. All of our course materials will be +freely available, enabling course participants to bring what they learned home with them. A +capstone session will be included in which we will provide suggestions about how the +materials presented in the course can be incorporated into existing course curricula. Course +faculty will be available to assist with integration effort after the course. Finally, the short course will facilitate scientific collaborations by engaging participants in utilizing these tools for collaborative research efforts. **[I feel like we are missing a concluding summary of broader impacts to pull this together - could be a brief bulleted summary of tools required by app as Andrew suggested - EJF]** -