diff --git a/.gitignore b/.gitignore index a88db7d..1c4e580 100644 --- a/.gitignore +++ b/.gitignore @@ -23,7 +23,8 @@ __pycache__/ .Rproj.user # Don't make budgets public -budgets/ +budgets/* +submission/budgets/* # Operating system specific files @@ -39,3 +40,4 @@ budgets/ ## Windows Thumbs.db [Dd]esktop.ini + diff --git a/biosketches/BioSketchHampton_CZI_2018a.docx b/biosketches/BioSketchHampton_CZI_2018a.docx new file mode 100644 index 0000000..99c2684 Binary files /dev/null and b/biosketches/BioSketchHampton_CZI_2018a.docx differ diff --git a/biosketches/greene-biosketch-updated.docx b/biosketches/greene-biosketch-updated.docx new file mode 100644 index 0000000..f07948b Binary files /dev/null and b/biosketches/greene-biosketch-updated.docx differ diff --git a/biosketches/patro_biosketch.docx b/biosketches/patro_biosketch.docx new file mode 100644 index 0000000..c060e20 Binary files /dev/null and b/biosketches/patro_biosketch.docx differ diff --git a/content/01.abstract.md b/content/01.abstract.md index c6ebbc4..8a2b4df 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -1,3 +1,34 @@ ## Abstract {.page_break_before} -**Instructions**: Describe your collaborative project, highlighting key achievements of the project; limited to 250 words. +**Instructions**: Describe your collaborative project, highlighting +key achievements of the project; limited to 250 words. + +The HCA provides a reference atlas to human cell types, states, and +the biological processes in which they engage. The utility of the +reference therefore requires that one can easily compare references to +each other, or a new sample to the compendium of reference +samples. Low-dimensional representations, because they compress the +space, provide the building blocks for search approaches that can be +practically applied across very large datasets such as the HCA. +Our seed network proposes to compress HCA data +into fewer dimensions that preserve the important attributes of the +original high dimensional data and yield interpretable, searchable +features. +We hypothesize that building an ensemble of low +dimensional representations across latent space methods will provide a +reduced dimensional space that captures biological sources of +variability and is robust to measurement noise. +We will identify techniques that learn interpretable, +biologically-aligned representations, improve techniques for fast and +accurate quantification, and implement these base enabling +technologies and methods for search, analysis, and latent space +transformations as freely available, open source software tools. +By using and extending our base enabling technologies, we will provide +three principle tools and resources for the HCA: +1) software to enable fast and accurate search and annotation using +low-dimensional representations of cellular features, +2) a versioned and annotated catalog of latent spaces corresponding to +signatures of cell types, states, and biological attributes across the +the HCA, and +3) short course and educational materials that will increase the use +and impact of low-dimensional representations and the HCA in general. diff --git a/content/02.fiverefs.md b/content/02.fiverefs.md index 0128ecb..02eef36 100644 --- a/content/02.fiverefs.md +++ b/content/02.fiverefs.md @@ -1,5 +1,6 @@ ## Five Key References * Hicks refs: [@doi:10.1093/biostatistics/kxx053] -* projectR & scCoGAPS: [@doi:10.1101/395004] +* ProjectR & scCoGAPS: [@doi:10.1101/395004] * Alevin: [@doi:10.1101/335000] +* Developing Mouse Retina: [@doi:10.1101/378950] diff --git a/content/03.pilist.md b/content/03.pilist.md index 8d62cdc..20fb917 100644 --- a/content/03.pilist.md +++ b/content/03.pilist.md @@ -34,7 +34,7 @@ * Tax ID: 23-1352685 (UPenn) * Email: greenescientist@gmail.com -5. Tom Hampton +5. Thomas Hampton * Title: Senior Bioinformatics Analyst * Degrees: PhD @@ -64,6 +64,6 @@ 2. Stephanie C. Hicks is an Assistant Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. She is an expert in statistical methodology with a strong track record in processing and analyzing single-cell genomics data, including extensive experience developing fast, memory-efficient R/Bioconductor software to remove systematic and technical biases from scRNA-seq data [@doi:10.1093/biostatistics/kxx053]. Dr. Hicks will work together with Co-PIs to implement fast search algorithms in latent spaces (Aim 1) and to implement the methods developed into fast, scalable, and memory-efficient R/Bioconductor software packages (Aim 3). 3. Elana Fertig is an Associate Professor of Oncology and Applied Mathematics and Statistics at Johns Hopkins University. She developed of the Bayesian non-negative matrix factorization algorithm CoGAPS [@doi:10.1093/bioinformatics/btq503] for latent space analysis. In collaboration with co-PI Goff, she adapted this tool to scRNA-seq data and developed a new transfer learning framework to relate the low-dimensional features in scRNA-seq data across data modalities, biological conditions, and organisms [@doi:10.1101/395004]. Dr. Fertig will work with the co-PIs to incorporate the error models from Aim 1 into the latent space representations, dimensionality estimation, and biological assessment metrics in Aim 2. She is developing standardized language for latent space representation in collaboration with co-PIs Goff and Greene [@doi:10.1016/j.tig.2018.07.003] that will provide a strong foundation for standardization of these approaches across different unsupervised learning tools. 4. Casey Greene -5. Tom Hampton +5. Tom Hampton is Director of Bioinformatic Training for two program projects at the Geisel School of Medicine at Dartmouth. In that role, he has a long collaboration with co-PI Casey Greene, with whom he has collaborated in the development of short courses taught at Mount Desert Island Biological Laboratory and at Dartmouth. Dr Hampton’s  bioinformatic research is focused on using data from multiple independent studies to identify concordant patterns of gene express in response to stressors such as infection and environmental stress. 6. Michael Love is an Assistant Professor of Biostatistics and Genetics at the University of North Carolina at Chapel Hill. He is a leading developer of statistical software for RNA-seq analysis in the Bioconductor Project, maintaining the widely used DESeq2 [@doi:10.1186/s13059-014-0550-8] and tximport [@doi:10.12688/f1000research.7563.1] packages. He is a close collaborator with Dr. Rob Patro on bias-aware estimation of transcript abundance from RNA-seq and estimation of uncertainty during transcript quantification [@doi:10.1038/nmeth.4197]. Dr. Love will work with co-PIs to disseminate versioned reference cell type catalogs through widely used frameworks for genomic data analysis including R/Bioconductor and Python. -7. Rob Patro +7. Rob Patro is an Assistant Professor of Computer Science at Stony Brook University. He leads the COMBINE-lab, that [develops and maintains numerous open-source genomics tools and methods](https://github.com/COMBINE-lab). He is the primary developer of the popular transcript quantification tools Sailfish [@doi:10.1038/nbt.2862] and Salmon [@doi:10.1038/nmeth.4197], having collaborated closely with Dr. Love on the latter. Dr. Love and he are actively collaborating on improved methods for transcript quantification, differential testing, and also on reproducible analysis via [tximeta](https://github.com/mikelove/tximeta) [@doi:10.18129/B9.bioc.tximeta]. He has recently focused on developing improved methods for gene-level quantification from tagged-end single-cell RNA-seq data, as implemented in the tool alevin [@doi:10.1101/335000]. He will work with co-PIs to develop improved single-cell quantification tools that account for gene-ambiguous reads and provide quantification uncertainty estimates (base enabling technologies) --- which is important for accurate and robust creation of reduced-dimensionality representations. He will work with the co-PIs to develop efficient algorithms and data structures to enable efficient expression and sample search over low-dimensional representations of HCA data (Aim 1). diff --git a/content/04.body.md b/content/04.body.md index b3bd3f8..0a87464 100644 --- a/content/04.body.md +++ b/content/04.body.md @@ -1,218 +1,220 @@ - ## Proposal Body (2000 words) -The Human Cell Atlas (HCA) provides unprecedented characterization of the -molecular states of each cell across tissues, organisms, and individuals. -Computational techniques that provide the ability to rapidly query, -characterize, and analyze this atlas will accelerate the pace of discovery in -biomedicine. HCA data are high dimensional, but they can often be compressed -into fewer dimensions without a substantial loss of information while yielding -interpretable features. For transcriptomic data, compressing on the gene -dimension is most attractive: it can be applied to single samples, and genes -often provide information about other co-regulated genes. In the best case, the -reduced dimensional space captures biological sources of variability while -ignoring noise and each dimension aligns to interpretable biological processes. - -Our seed network aims to create low-dimensional representations that provide -search and catalog capabilities for the HCA. The benefit of these approaches -will become particularly pronounced as the number of cells and tissues becomes -large. Our **__central hypothesis__** is that these approaches will enable -faster algorithms while reducing the influence of technical noise. We propose to -advance **__base enabling technologies__** for low-dimensional representations. -We also propose three aims: 1) fast and accurate search for cell, samples, and -pathways; 2) a catalog of cell types and biological processes in low-dimensional -spaces; and 3) educational materials to increase the impact of low-dimensional -representations and the HCA in general. - -*The first goal of our base enabling technology work* is to identify techniques -that learn interpretable, biologically-aligned representations. We consider both -linear and non-linear techniques. For linear techniques, we rely on our Bayesian, -non-negative matrix factorization method scCoGAPS -[@doi:10.1101/378950,@doi:10.1101/395004] (PI Fertig). This technique learns -biologically relevant features across contexts and data modalities -[@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], -including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. -This technique is specifically selected as a base enabling technnology -because its error distribution can naturally account for measurement-specific -technical variation [@doi:10.1371/journal.pone.0078127] and its prior distributions -for different feature quantifications or spatial information. -For non-linear needs, neural networks with multiple layers, provide a -complementary path to low-dimensional representations [@doi:10.1101/385534] (PI -Greene) that model these diverse features of HCA data. We note that many groups are working in this area for both linear and -non-linear techniques (e.g., +The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes +across individuals, tissues and disease states -- resolving differences to the level of +individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically +high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions +that preserve the important attributes of the original high dimensional data and yield +interpretable, searchable features. For transcriptomic data, compressing on the gene +dimension is most attractive: it can be applied to single samples, and genes often provide +information about other co-regulated genes or cellular attributes. We hypothesize that building an ensemble of low dimensional representations across latent space methods will provide a +reduced dimensional space that captures biological sources of variability and is robust to measurement noise. Our seed network will +incorporate biologists and computer scientists from five leading academic institutions who will work together to create foundational technologies +and educational opportunities that promote effective interpretation of low dimensional representations of HCA data. We will continue our active collaborations with other +members of the broader HCA network to integrate state of the art latent space tools, portals, and annotations to enable biological utilization of HCA data through latent spaces. + +## Scientific Goals + +We will create low-dimensional representations that provide search and catalog capabilities +for the HCA. Given both the scale of data, and the inherent complexity of biological +systems, we believe these approaches are crucial to the long term success of the HCA. Our +**__central hypothesis__** is that these approaches will enable faster algorithms while +reducing the influence of technical noise. We propose to advance **__base enabling +technologies__** for low-dimensional representations. + +First, we will identify techniques that learn interpretable, biologically-aligned +representations. We will consider both linear and non-linear techniques as each may identify +distinct components of biological systems. For linear techniques, we rely on our Bayesian, +non-negative matrix factorization method scCoGAPS [@doi:10.1101/378950,@doi:10.1101/395004] +(PIs Fertig & Goff). This technique learns biologically relevant features across contexts +and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], +including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is +specifically selected as a base enabling technnology because its error distribution can +naturally account for measurement-specific technical variation +[@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature +quantifications or spatial information. For non-linear needs, neural networks with multiple +layers provide a complementary path to low-dimensional representations +[@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will +make use of substantial progress that has already been made in both linear and non-linear +techniques (e.g., [@doi:10.1101/300681,@doi:10.1101/292037,@doi:10.1101/237065,@doi:10.1101/315556,@doi:10.1101/457879,@doi:10.1016/j.cell.2017.10.023,@doi:10.7717/peerj.2888,@doi:10.1101/459891]). -Because of the substantial number of groups developing neural network based -methods, we do not currently plan additional efforts on methods development beyond scCoGAPS. However, we -will continue to use and rigorously evaluate these methods. We will incorporate -the best performing methods into our search and catalog tools. The latent space -team from the HCA collaborative networks RFA (including PIs Fertig, Goff, -Greene, and Patro) is defining common output formats for low-dimensional -representations from distinct classes of methods. - -The *second part of our work on base enabling technologies* is the improvement -of techniques for fast and accurate quantification. Existing approaches for -quantification from scRNA-seq data using tagged-end end protocols (e.g. 10x -Chromium, drop-Seq, inDrop, etc.) have no mechanism for accounting for reads -mapping between multiple genes in the resulting quantification estimates. This -affects approximately 15-25% of the reads in a typical experiment. It reduces -quantification accuracy, and leads to systematic biases in gene expression -estimates that correlate with the size of gene families and gene function -[@doi:10.1101/335000]. We recently developed a quantification method for -tagged-end data that accounts for reads mapping to multiple genomic loci in a -principled and consistent way [**CITE?**]. We will expand on this work by, -building these capabilities into a production quality tool for the processing of -scRNA-seq data. The tool will support: 1. Exploring alternative models for UMI -resolution. 2. Developing new approaches for quality control and filtering using -the UMI-resolution graph. 3. Creating a compressed and indexible data structure -for the UMI-resolution graph to enable direct access, query, and fast search. - -We will implement the base enabling technologies and methods for search, -analysis, and transformation into R/Bioconductor and Python frameworks. The -python and R software will use common input and output formats. The software -will be fast, scalable, and memory-efficient because will leverage the -computational tools previously developed by Bioconductor for single-cell data -access to the HCA, data representation (`SingleCellExperiment`, `beachmat`, -`DelayedArray`, `HDF5Array` and `rhdf5`) and data assessment and amelioration of -data quality (`scater`, `scran`, `DropletUtils`). +and rigorously evaluate emerging methods into our search and catalog tools. We will extend +transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to +enable rapid integration, interpretation, and annotation of learned latent spaces. The +latent space team from the HCA collaborative networks RFA (including PIs Fertig, Goff, +Greene, and Patro) is establishing common definitions and requirements for latent spaces +for the HCA, as well as standardized output formats for low-dimensional representations from +distinct classes of methods. + +Second, we will improve techniques for fast and accurate quantification. Existing approaches +for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, +etc.) do not account for reads mapping between multiple genes. This affects approximately +15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and +leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address +this, we will build on our recently developed quantification method for tagged-end data that +accounts for reads mapping to multiple genomic loci in a principled and consistent way +[@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for +scRNA-Seq preprocessing. Our tool will support: 1. Exploration of alternative models for +Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality +control and filtering using the UMI-resolution graph. 3. Creation of a compressed and +indexible data structure for the UMI-resolution graph to enable direct access, query, and +fast search prior to secondary analysis. + +We will implement these base enabling technologies and methods for search, +analysis, and latent space transformations as freely available, open source software tools. +We will additionally develop platform-agnostic input and output data formats and standards +for latent space representations of the HCA data to maximize interoperability. The software +tools produced will be fast, scalable, and memory-efficient by leveraging the available +assets and expertises of the R/Bioconductor project (PIs Hicks & Love) as well as the +broader HCA community. + +By using and extending our base enabling technologies, we will provide three principle +tools and resources for the HCA. These include 1) software to enable fast and accurate +search and annotation using low-dimensional representations of cellular features, 2) a +versioned and annotated catalog of latent spaces corresponding to signatures of cell types, +states, and biological attributes across the the HCA, and 3) short course and educational +materials that will increase the use and impact of low-dimensional representations and the +HCA in general. ### Aim 1 -*Rationale:* The HCA provides a reference atlas to human cells, cell types, and -the pathways that they express. Scientists will benefit most from the HCA when -they can quickly identify find cells and cell types and compare references to -find differences. Low-dimensional representations, because they compress the -space, provide the building blocks for search approaches that can be practically -applied across very large datasets such as the HCA. *We propose to develop -algorithms and software for efficient search over the HCA using low-dimensional -representations.* +*Rationale:* The HCA provides a reference atlas to human cell types, states, and the +biological processes in which they engage. The utility of the reference therefore requires +that one can easily compare references to each other, or a new sample to the compendium of +reference samples. Low-dimensional representations, because they compress the space, provide +the building blocks for search approaches that can be practically applied across very large +datasets such as the HCA. *We propose to develop algorithms and software for efficient +search over the HCA using low-dimensional representations.* The primary approach to search in low-dimensional spaces is straightforward: one -must create an appropriate low-dimensional representation and identify a -distance function or functions that match what biologists seek. Using the -low-dimensional representation improves speed and can also reduce noise. We will -evaluate representations for their ability to support search and implement the -best performing approach. However, the most obvious approaches require -investigators to perform quantification on the entirety of a new sample and -select cells or cell types that they wish to search for. We also aim to enable -search even before investigators complete quantification. This will allow -software to identify similar tissues or identify cells that are unusual as data -are being collected. We will implement and evaluate techniques to learn shared -low-dimensional representations between the UMI-resolution graph and quantified -samples, so that samples where either component is available can be used for -search **[CASEY ADD SHARED LATENT SPACE REF]**. -These UMI-graphs will be embedded in the prior of scCoGAPS and architecture of non-linear latent space techniques. - -Reference genomes allow scientists to identify specific differences between the -reference and genomes of interest. We will use these representations to quantify -differences between a reference transcriptome map (the HCA) and target -transcriptome maps from samples of interest. -**Elana: I find this confusing -- are we referring to reference genome builds or references in low dimensional space? Need to clarify.** -We will leverage common -low-dimensional representations and cell-to-cell correlation structure both -within and across transcriptome maps. Quantifying the differences between -samples characterized at the single-cell level reveals population or individual -level differences. One could compare ten scRNA-seq maps from individuals with a -particular phenotype to the HCA reference. We (PI Hicks) have extensive +must create an appropriate low-dimensional representation and identify distance functions +that enable biologically meaningful comparisons. Ideal low-dimensional representations are +predicted to be much faster to search, and potentially more biologically relevant, as noise +can be removed. In this aim, we will evaluate novel low-dimensional representations to +identify those with optimal qualities of compression, noise reduction, and retention of +biologically meangful features. Current scRNA-Seq approaches require investigators to +perform gene-level quantification on the entirety of a new sample. We aim to enable search +during sample preprocessing, prior to gene-level quantification. This will enable in-line +annotation of cell types and states and identification of novel features as samples are +being processed. We will implement and evaluate techniques to learn and transfer shared +low-dimensional representations between the UMI-resolution graph and quantified samples, so +that samples where either component is available can be used for search and annotation +**[CASEY ADD SHARED LATENT SPACE REF]**. These UMI-graphs will be embedded in the prior of +scCoGAPS and architecture of non-linear latent space techniques. **[Do we need this line? +It's a bit more specific than the rest of the paragraph -LAG]** +**[I think we need something to link in how this fits to the latent space methods -- maybe not so specific, but something that ties it back beyond preprocessing - EJF]** + +Similarly to the approach by which comparisons to a reference genomes can identify specific +differences in a genome of interest, we will use low-dimensional representations from latent +spaces to define a reference transcriptome map (the HCA), and use this to quantify +differences in target transcriptome maps from new samples of interest. We will leverage +common low-dimensional representations and cell-to-cell correlation structure both within +and across transcriptome maps from Aim 2 to define this reference. Quantifying the +differences between samples characterized at the single-cell level reveals population or +individual level differences. +**[<-- I'm not sure what this sentence means. Please clarify. - LAG]** +**[My take is that it means if we have an average from the catalogue we've built for a cell type or state, that deviations in particular samples could yield context-specific differences, not sure how to reword - EJF]** +Comparison of scRNA-seq maps from individuals with a particular phenotype +to the HCA reference, which is computationally infeasible from the large scale of HCA data, +becomes tractable in these low dimensional spaces. We (PI Hicks) have extensive experience dealing with the distributions of cell expression within and between individuals [@pmid:26040460], which will be critical for defining an appropriate -metric. We plan to implement and evaluate linear mixed models to account for the -correlation structure within and between transcriptome maps. This statistical -method will be fast, memory-efficient and will scale to billions of cells -because we will use low-dimensional representations. -**Elana: these linear mixed models seem to go away from base enabling technologies. I think this would read better edited to incorporate these -distributions in the architecture of nonlinear-methods or in the prior of scCoGAPS to reflect better integration. How are these reflecting latent spaces?** +metric to compare references in latent spaces. We plan to implement and evaluate +linear mixed models to account for the correlation structure within and between +transcriptome maps. This statistical method will be fast, memory-efficient and will +be scalable to billions of cells using low-dimensional representations. ### Aim 2 -*Rationale:* Biological systems are comprised of diverse cell types with -overlapping molecular phenotypes and biological processes are often reused with -modifications across cell types. Low-dimensional representations can reveal -these fundamental mechanisms across large collections of data including the HCA. -We are evaluating and selecting methods that define basis vectors that reflect -discrete biological processes or features. These basis vectors can be shared -across different biological systems and can reveal context-specific -perturbations such as pathogenic differences in disease. *We propose a central -catalog of cell types and biological processes derived from low-dimensional +*Rationale:* Biological systems are comprised of diverse cell types and states with +overlapping molecular phenotypes. Furthermore, biological processes are often reused with +modifications across cell types. Low-dimensional representations can identify these shared +features, independent of total distance between cells in gene expression space, across large +collections of data including the HCA. We will evaluate and select methods that define +latent spaces that reflect discrete biological processes or cellular features. These latent +spaces can be shared across different biological systems and can reveal context-specific +divergence such as pathogenic differences in disease. *We propose to establish a central +catalog of cell types, states, and biological processes derived from low-dimensional representations of the HCA.* -Basing a catalog of cell types and their corresponding processes off of multiple -low-dimensional representations can reduce noise and aid in biological -interpretability. However, there are currently no standardized, quantitative -metrics to determine the extent to which low-dimensional representations capture -generalizable biolobical features. We have developed new transfer learning -methods to quantify the extent to which latent space representations from one -set of training data are represented in another -[@doi:10.1101/395004,@doi:10.1101/395947]. These provide a strong foundation to -compare low-dimensional representations across different low dimensional data representation technniques. -Generalizable representations should -transfer across datasets of related biological contexts. -In addition, we have -found that combining multiple representations can better capture biological -processes across scales [@doi:10.1016/j.cels.2017.06.003], and that -representations across scales capture distinct, valid signatures -[@doi:10.1371/journal.pone.0078127]. -Therefore, we will form a catalogue from the set of low dimensional features learned -across linear and non-linear methods from our base enabling technologies and proposed +Establishing a catalog of cellular features using low-dimensional representations can +reduce noise and aid in biological interpretability. However, there are currently no +standardized, quantitative metrics to determine the extent to which low-dimensional +representations capture generalizable biological features. We have developed new transfer +learning methods to quantify the extent to which latent space representations from one +set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] +(PIs Greene, Goff & Fertig). +These provide a strong foundation to compare different low-dimensional representations +and techniques for learning and transferring knowledge between them [**<-- didn't understand +what was here before too well, please make sure I didn't muck with the meaning too much.**] +Generalizable representations should transfer across datasets of related biological +contexts, while representations of noise will not. In addition, we have found that combining +multiple representations can better capture biological processes across scales +[@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, +valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will +establish a catalog consisting of low-dimensional features learned across both +linear and non-linear methods from our base enabling technologies and proposed extensions in Aim 1. - -We will package and version reference cell types and their corresponding -low-dimensional representations and deliver these as structured data objects in -Bioconductor and Python. Such summaries and annotations have proven widely -successful for the ENCODE, Roadmap Epigenome Mapping, and GTEx projects. We are -core package developers and power users of Bioconductor (PIs Hicks and Love) and -will support on-the-fly downloading of these materials via the *AnnotationHub* -framework. To enable reproducible research leveraging HCA, we will implement a -content-based versioning system, which identifies versions of the reference cell -type catalog by the gene weights and transcript nucleotide sequences using a -hash function. We (PI Love) developed hash-based versioning and provenance -identification and detection framework for bulk RNA-seq that supports -reproducible computational analyses and has proven to be successful -[@doi:10.18129/B9.bioc.tximeta]. This will help to avoid scenarios where -researchers report on matches to a certain cell type in HCA without precisely -defining which definition of that cell type. We will develop *F1000Research* -workflows demonstrating how HCA-defined reference cell types and tools developed -in this RFA can be used within a typical genomic data analysis. +We will package and version low-dimensional representations and annotate these +representations based on their corresponding celluar features (e.g. cell type, tissue, +biological process) and deliver these as structured data objects in Bioconductor as well as +platform-agnostic data formats. Where applicable, we will leverage the computational tools +previously developed by Bioconductor for single-cell data access to the HCA, data +representation (`SingleCellExperiment`, `beachmat`, `LinearEmbeddingMatrix`, `DelayedArray`, +`HDF5Array` and `rhdf5`) and data assessment and amelioration of data quality (`scater`, +`scran`, `DropletUtils`). We are core package developers and +power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of +these materials via the *AnnotationHub* framework. To enable reproducible research +leveraging HCA, we will implement a content-based versioning system, +which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide +sequences using a hash function. We (PIs Love and Patro) previously developed hash-based versioning and provenance +detection framework for bulk RNA-seq that supports reproducible +computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. +Our versioning and dissemination of reference cell type catalogs +will help to avoid scenarios where researchers report on matches to a certain cell type in +HCA without precisely defining which definition of that cell type. We will develop +*F1000Research* workflows demonstrating how HCA-defined reference cell types and tools +developed in this RFA can be used within a typical genomic data analysis. This catalogue +will be used as the basis of defining the references for cell type and state, or +individual-specific differences with the linear models proposed in Aim 1. ### Aim 3 -*Rationale:* Low-dimensional representations for scRNA-seq and HCA data make -tasks faster and provide interpretable summaries of complex high-dimensional -data. The HCA data associated methods, will be valuable to many biomedical -fields, but their use will require experience with this new toolkit. A scalable -education effort that reaches students at and beyond undergraduate level will be -needed to prepare students and maximize impact. *We propose short-course -training for the HCA, single cell profiling, machine learning methods, -low-dimensional representations, and tools developed by our group in response to -this RFA.* - -Our educational program is based on a one-week short course that we (PI Hampton) -have run annually at Mount Desert Island Biological Lab over the last **X TOM -FILL IN** years. The course covers R, gene expression analysis, statistical -interpretation, and introduces machine learning (PI Greene). Attendees rate the -course well and report that they incorporate new knowledge into their research -and teaching. For this grant we will add topics centered on the HCA and increase -the frequency of the course. We will run the course at locations distributed -throughout the US and provide open course materials on GitHub to allow others to -replicate the course. New topics will include: - -- Comparison of Bulk and Single-cell Assays and Data -- The Human Cell Atlas Project -- scRNA-seq: Expression Quantification and Cell Type Discovery -- scRNA-seq: Low-dimensional Representations -- scRNA-seq: Search and Analysis in Low-dimensional Representations - -We aim to provide a force-multiplier for the HCA and low-dimensional methods as -course attendees transmit what they learn to tens of students each year at their -own institutions. We will run this course on a cost recovery model, but to -maximize the multiplier effect we budget at least *ten scholarships* per -offering to cover the room, board, and tuition of faculty who are primarily -engaged in undergraduate instruction. This will allow faculty who will -disseminate these materials in their own reaching to attend at very low cost. We -will develop a one-week module that can be added in to an undergraduate class on -single-cell profiling and the HCA, which we will distribute via GitHub. -Materials will include recorded videos (intended for a refresher for -instructors), slides, and exercises. We expect that this module will support -faculty who attend with an easy enhancement to any bioinformatics or -computational biology instruction that they are already providing at their -institution. +*Rationale:* Low-dimensional representations of scRNA-seq and HCA data make tasks faster and +provide interpretable summaries of complex high-dimensional cellular features. The HCA +data-associated methods and workflows will be valuable to many biomedical fields, but their +use will require an understanding of basic bioinformatics, scRNA-Seq, and how the tools +being developed work. Furthermore, researchers will need exposure to the conceptual basis of +low-dimensional interpretations of biological systems. This aim addresses these needs in +three ways. + +First, we will develop a bioinformatic training program for biologists at all levels, +including those with no experience in bioinformatics. Lecture materials will be extended +from existing materials from previous bioinformatic courses we (PI Hampton) have run at +Mount Desert Island Biological Laboratory, the University of Birmingham, UK, and Geisel +School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists +in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of +the success of these learning experiences has to do with our instructional paradigm, which +includes a very challenging course project coupled with one-on-one support from instructors. +We will develop a new curriculum specifically tailored to HCA that incorporates: 1) didactic +course material on single cell gene expression profiling (PI Goff), 2) +machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional +representations, and 5) tools developed by our group in response to this RFA. + +Second, the short course will train not only students, but also instructors. Our one-on-one +approach to course projects will require a high instructor-to-student ratio. We will +therefore recruit former participants of this class to return in subsequent years, first as +teaching assistants, and later as module presenters. We have found that course alumni are +eager to improve their teaching resumes, that they learn the material in a new way as they +begin to teach it, and that they are an invaluable resource in understanding how to improve +the course over time. Part of our strategy is to support this community, which includes many +people who will drive the next wave of innovation. All of our course materials will be +freely available, enabling course participants to bring what they learned home with them. A +capstone session will be included in which we will provide suggestions about how the +materials presented in the course can be incorporated into existing course curricula. Course +faculty will be available to assist with integration effort after the course. Finally, the short course will facilitate scientific collaborations +by engaging participants in utilizing these tools for collaborative research efforts. + +**[I feel like we are missing a concluding summary of broader impacts to pull this together - could be a brief bulleted summary of tools required by app as Andrew suggested - EJF]** + diff --git a/content/metadata.yaml b/content/metadata.yaml index 6f5a4b7..7b0d2f1 100644 --- a/content/metadata.yaml +++ b/content/metadata.yaml @@ -67,3 +67,12 @@ author_info: affiliations: - Department of Biostatistics, University of North Carolina at Chapel Hill - Department of Genetics, University of North Carolina at Chapel Hill + - + github: + name: Thomas H. Hampton + initials: THH + orcid: 0000-0003-0543-402X + twitter: + email: Thomas.H.Hampton@dartmouth.edu + affiliations: + - Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth diff --git a/edits/body_text_f7af0a3.txt b/edits/body_text_f7af0a3.txt new file mode 100644 index 0000000..2733c06 --- /dev/null +++ b/edits/body_text_f7af0a3.txt @@ -0,0 +1,42 @@ +Proposal Body (2000 words) +The Human Cell Atlas (HCA) provides unprecedented characterization of the molecular phenotypes of each cell across tissues, organisms, and individuals. Computational techniques that provide the ability to rapidly query, characterize, and analyze this atlas will accelerate the pace of discovery in biomedicine. HCA data are high dimensional, but they can often be compressed into fewer dimensions without a substantial loss of information while yielding interpretable features. For transcriptomic data, compressing on the gene dimension is most attractive: it can be applied to single samples, and genes often provide information about other co-regulated genes or cellular attributes. In the best case, the reduced dimensional space captures biological sources of variability while ignoring noise and each dimension aligns to interpretable biological processes. + +Scientific Goals +Our seed network aims to create low-dimensional representations that provide search and catalog capabilities for the HCA. The benefit of these approaches will become particularly pronounced as the number of cells and tissues becomes large. Our central hypothesis is that these approaches will enable faster algorithms while reducing the influence of technical noise. We propose to advance base enabling technologies for low-dimensional representations. + +The first goal of our base enabling technology work is to identify techniques that learn interpretable, biologically-aligned representations. We consider both linear and non-linear techniques. For linear techniques, we rely on our Bayesian, non-negative matrix factorization method scCoGAPS [@doi:10.1101/378950,@doi:10.1101/395004](PIs Fertig & Goff). This technique learns biologically relevant features across contexts and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is specifically selected as a base enabling technnology because its error distribution can naturally account for measurement-specific technical variation [@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature quantifications or spatial information. For non-linear needs, neural networks with multiple layers, provide a complementary path to low-dimensional representations [@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We note that many groups are working in this area for both linear and non-linear techniques (e.g., [@doi:10.1101/300681,@doi:10.1101/292037,@doi:10.1101/237065,@doi:10.1101/315556,@doi:10.1101/457879,@doi:10.1016/j.cell.2017.10.023,@doi:10.7717/peerj.2888,@doi:10.1101/459891]). Because of the substantial number of groups developing neural network based methods, we do not currently plan additional efforts on methods development beyond scCoGAPS. However, we will continue to use and rigorously evaluate these methods and incorporate the best performing methods into our search and catalog tools. We will extend transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to enable rapid integration, interpretation, and annotation of learned latent spaces. The latent space team from the HCA collaborative networks RFA (including PIs Fertig, Goff, Greene, and Patro) is establishing common definitions and requirements for latent spaces for the HCA, as well as standardized output formats for low-dimensional representations from distinct classes of methods. + +The second goal of our base enabling technology work is the improvement of techniques for fast and accurate quantification. Existing approaches for quantification from scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, etc.) have no mechanism to account for reads mapping between multiple genes in the resulting quantification estimates. This affects approximately 15-25% of the reads generated in a typical experiment, reduces quantification accuracy, and leads to systematic biases in gene expression estimates that correlate with the size of gene families and gene function[@doi:10.1101/335000]. We recently developed a quantification method for tagged-end data that accounts for reads mapping to multiple genomic loci in a principled and consistent way [CITE?]. We will expand on this work by, building these capabilities into a production quality tool for the processing of scRNA-seq data. The tool will support: 1. Exploring alternative models for Unique Molecular Identifier (UMI) resolution. 2. Developing new approaches for quality control and filtering using the UMI-resolution graph. 3. Creating a compressed and indexible data structure for the UMI-resolution graph to enable direct access, query, and fast search prior to secondary analysis. + +We will implement the base enabling technologies and methods for search, analysis, and latent space transformations into R/Bioconductor. We will additionally develop platform-agnostic input and output formats for latent space representations of the HCA data to maximize interoperability. The software tools produced will be fast, scalable, and memory-efficient because we will leverage the computational tools previously developed by Bioconductor for single-cell data access to the HCA, data representation (SingleCellExperiment, beachmat, LinearEmbeddingMatrix, DelayedArray, HDF5Array and rhdf5) and data assessment and amelioration of data quality (scater, scran, DropletUtils). + +Tools and Resources +By using and extending our base enabling technologies we propose to develop three principle tools and resources for the HCA. These include 1) software to enable fast and accurate search and annotation using low-dimensional representations of cellular features, 2) a versioned and annotated catalog of latent spaces corresponding to signatures of cell types, states, and biological attributes across the the HCA, and 3) educational materials to increase the use and impact of low-dimensional representations and the HCA in general. + +Aim 1 +Rationale: The HCA provides a reference atlas to human cell types, states, and the biological processes they engage. Scientists will benefit most from the HCA when they can quickly identify cell types and states and compare references to find differences. Low-dimensional representations, because they compress the space, provide the building blocks for search approaches that can be practically applied across very large datasets such as the HCA. We propose to develop algorithms and software for efficient search over the HCA using low-dimensional representations. + +The primary approach to search in low-dimensional spaces is straightforward: one must create an appropriate low-dimensional representation and identify a distance function or functions that match what biologists seek. Using the low-dimensional representation improves speed and can also reduce noise. We will evaluate representations for their ability to support search and implement the best performing approaches. Current approaches require investigators to perform gene-level quantification on the entirety of a new sample. We aim to enable search during sample preprocessing, prior to gene-level quantification. This will enable in-line annotation of cell types and states and identification of novel features as samples are being processed. We will implement and evaluate techniques to learn and transfer shared low-dimensional representations between the UMI-resolution graph and quantified samples, so that samples where either component is available can be used for search and annotation [CASEY ADD SHARED LATENT SPACE REF]. These UMI-graphs will be embedded in the prior of scCoGAPS and architecture of non-linear latent space techniques. [Do we need this line? It's a bit more specific than the rest of the paragraph -LAG] + +Similarly to the approach by which comparisons to a reference genomes can identify specific differences in a genome of interest, we will use low-dimensional representations from latent spaces to define a reference transcriptome map (the HCA) and use this to quantify differences in target transcriptome maps from new samples of interest. We will leverage common low-dimensional representations and cell-to-cell correlation structure both within and across transcriptome maps from Aim 2 to define this reference. Quantifying the differences between samples characterized at the single-cell level reveals population or individual level differences. [<-- I'm not sure what this sentence means. Please clarify. - LAG] Comparison of scRNA-seq maps from individuals with a particular phenotype to the HCA reference that is computationally infeasible from the large scale of HCA data becomes tractable in these low dimensional spaces. We (PI Hicks) have extensive experience dealing with the distributions of cell expression within and between individuals [@pmid:26040460], which will be critical for defining an appropriate metric to compare references in latent spaces. We plan to implement and evaluate linear mixed models to account for the correlation structure within and between transcriptome maps. This statistical method will be fast, memory-efficient and will be scalable to billions of cells using low-dimensional representations. + +Aim 2 +Rationale: Biological systems are comprised of diverse cell types and states with overlapping molecular phenotypes. Furthermore, biological processes are often reused with modifications across cell types. Low-dimensional representations can identify these shared features, independent of total distance between cells in gene expression space, across large collections of data including the HCA. We will evaluate and select methods that define latent spaces that reflect discrete biological processes or cellular features. These latent spaces can be shared across different biological systems and can reveal context-specific divergence such as pathogenic differences in disease. We propose to establish a central catalog of cell types, states, and biological processes derived from low-dimensional representations of the HCA. + +By establishing a catalog of cellular features using low-dimensional representations can reduce noise and aid in biological interpretability. However, there are currently no standardized, quantitative metrics to determine the extent to which low-dimensional representations capture generalizable biolobical features. We have developed new transfer learning methods to quantify the extent to which latent space representations from one set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947] (PIs Goff & Fertig). These provide a strong foundation to compare low-dimensional representations across different low-dimensional data representation technniques. Generalizable representations should transfer across datasets of related biological contexts, while representations of noise will not. In addition, we have found that combining multiple representations can better capture biological processes across scales [@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will establish a versioned catalog consisting of low-dimensional features learned across both linear and non-linear methods from our base enabling technologies and proposed extensions in Aim 1. + +We will package and version low-dimensional representations and annotate these representations based on their corresponding celluar features (e.g. cell type, tissue, biological process) and deliver these as structured data objects in Bioconductor as well as platform-agnostic data formats. Such summaries and annotations have proven widely successful for the ENCODE, Roadmap Epigenome Mapping, and GTEx projects. We are core package developers and power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of these materials via the AnnotationHub framework. To enable reproducible research leveraging HCA, we will implement a content-based versioning system, which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide sequences using a hash function. We (PI Love) developed hash-based versioning and provenance identification and detection framework for bulk RNA-seq that supports reproducible computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. This will help to avoid scenarios where researchers report on matches to a certain cell type in HCA without precisely defining which definition of that cell type. We will develop F1000Research workflows demonstrating how HCA-defined reference cell types and tools developed in this RFA can be used within a typical genomic data analysis. This catalogue will be used as the basis of defining the references for cell type and state, or individual-specific differences with the linear models proposed in Aim 1. + +Aim 3 +Rationale: Low-dimensional representations for scRNA-seq and HCA data make tasks faster and provide interpretable summaries of complex high-dimensional cellular features. The HCA data associated methods and workflows will be valuable to many biomedical researchers, but their use will require experience with this new toolkit. Furthermore, researchers will need exposure to the conceptual basis of low-dimensional interpretations of biological systems. To address these issues, we propose a scalable education effort that reaches students at and beyond undergraduate level enable faster adoption and interpretation of the HCA, and to maximize its impact. We propose short-course training for the HCA, single cell profiling, machine learning methods, low-dimensional representations, and tools developed by our group in response to this RFA. + +Our educational program is based on a one-week short course that we (PI Hampton) have run annually at Mount Desert Island Biological Lab over the last X TOM FILL IN years. The course covers R, gene expression analysis, statistical interpretation, and introduces machine learning (PI Greene). Attendees rate the course well and report that they incorporate new knowledge into their research and teaching. Additionally, we have previously developed didactic course material on single cell RNA-Seq analysis for the annual McKusick Short Course on Human and Mammalian Genetics (PI Goff) at Jackson Labs. For this grant we will extend these educational opportunities by developing topics and materials centered on the HCA and interpretation of low-dimensional latent spaces. We (PI Hampton) will run the course at locations distributed throughout the US and provide open course materials on GitHub to allow others to replicate the course. + +New topics will include: + +- Comparison of Bulk and Single-cell Assays and Data +- The Human Cell Atlas Project +- scRNA-seq: Expression Quantification and Cell Type Annotation +- scRNA-seq: Low-dimensional Representations +- scRNA-seq: Search and Analysis in Low-dimensional Representations +We aim to provide a force-multiplier for the HCA and low-dimensional methods as course attendees transmit what they learn to tens of students each year at their own institutions. We will run this course on a cost recovery model, but to maximize the multiplier effect we budget at least ten scholarships per offering to cover the room, board, and tuition of faculty who are primarily engaged in undergraduate instruction. This will allow faculty who will disseminate these materials in their own reaching to attend at very low cost. We will develop a one-week module that can be added in to an undergraduate class on single-cell profiling and the HCA, which we will distribute via GitHub. Materials will include recorded videos (intended for a refresher for instructors), slides, and exercises. We expect that this module will support faculty who attend with an easy enhancement to any bioinformatics or computational biology instruction that they are already providing at their institution. \ No newline at end of file diff --git a/edits/body_text_thh.txt b/edits/body_text_thh.txt new file mode 100644 index 0000000..83281cf --- /dev/null +++ b/edits/body_text_thh.txt @@ -0,0 +1,39 @@ +Proposal Body (2000 words) +The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes across individuals, tissues and disease states -- resolving differences to the level of individual cells. The enormous opportunities for scientific advancement that HCA provides will depend on our ability to search a transcendentally high dimensional space where each of thousands cells in a single sample generates unique gene expression profiles of tens of thousands of genes. Although challenging, it is possible to create low dimension representations of HCA data that preserve the important features of the original high dimensional data, but allow accurate and rapid searching and classification. Our seed network will bring together biologists and computer scientists from five leading academic institutions who will work together to create foundational technologies educational opportunities that promote effective use of Human Cell Atlas data. + +Scientific Goals +We will create low-dimensional representations that provide search and catalog capabilities for the HCA. We believe these approaches are crucial to the long term success of the HCA [I think their current plan is a free text search of the archive based on either cell type or tissue, or conditions. In my opinion, that is much better than nothing. Other repositories like GEO sort of work, and that's their approach. We need t articulate why that is inadequate or at least sub-optimal -THH] Our central hypothesis is that these approaches will enable faster algorithms while reducing the influence of technical noise. We propose to advance base enabling technologies for low-dimensional representations. + +First, we will identify techniques that learn interpretable, biologically-aligned representations. We will consider both linear and non-linear techniques [because X]. For linear techniques, we rely on our Bayesian, non-negative matrix factorization method scCoGAPS [@doi:10.1101/378950,@doi:10.1101/395004](PIs Fertig & Goff). This technique learns biologically relevant features across contexts and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is specifically selected as a base enabling technology because its error distribution can naturally account for measurement-specific technical variation [@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature quantifications or spatial information. For non-linear needs, neural networks with multiple layers, provide a complementary path to low-dimensional representations [@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will make use of substantial progress that has already been made in both linear and non-linear techniques (e.g., [@doi:10.1101/300681,@doi:10.1101/292037,@doi:10.1101/237065,@doi:10.1101/315556,@doi:10.1101/457879,@doi:10.1016/j.cell.2017.10.023,@doi:10.7717/peerj.2888,@doi:10.1101/459891]) and rigorously evaluate emerging methods into our search and catalog tools. We will extend transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to enable rapid integration, interpretation, and annotation of learned latent spaces. The latent space team from the HCA collaborative networks RFA (including PIs Fertig, Goff, Greene, and Patro) is establishing common definitions and requirements for latent spaces for the HCA, as well as standardized output formats for low-dimensional representations from distinct classes of methods. + +Second, we will improve techniques for fast and accurate quantification. Existing approaches for quantification for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, etc.) have do not account for reads mapping between multiple genes. This reduces the accuracy of approximately 15-25% of the reads generated in a typical experiment, leading to systematic biases in the assignment of gene families and gene function[@doi:10.1101/335000]. To address this problem, we will build on a recently developed a quantification method for tagged-end data that accounts for reads mapping to multiple genomic loci in a principled and consistent way [CITE?]and extend it to scRNA-seq data. Our quantification tool will support: 1. Exploring alternative models for Unique Molecular Identifier (UMI) resolution. 2. Developing new approaches for quality control and filtering using the UMI-resolution graph. 3. Creating a compressed and indexible data structure for the UMI-resolution graph to enable direct access, query, and fast search prior to secondary analysis. + +We will implement these base enabling technologies and methods for search, analysis, and latent space transformations as freely available R/Bioconductor libraries. We will additionally develop platform-agnostic input and output formats for latent space representations of the HCA data to maximize interoperability. These R libraries will be fast, scalable, and memory-efficient leveraging the computational tools previously developed by Bioconductor for single-cell data access to the HCA, data representation (SingleCellExperiment, beachmat, LinearEmbeddingMatrix, DelayedArray, HDF5Array and rhdf5) and data assessment and amelioration of data quality (scater, scran, DropletUtils). + + +Third, in addition to creating base enabling technologies and tools, we will provide other resources to the HCA community. These include 1) a versioned and annotated catalog of latent spaces, and corresponding to signatures of cell types, states, and biological attributes across the the HCA, and 2) short courses and educational materials that will increase the use and impact of low-dimensional representations and the HCA in general. + +Aim 1 +Rationale: The HCA provides a reference atlas to human cell types, states, and the biological processes in which they engage. The utility of the reference therefore requires that one can easily compare references to each other or a new sample to the compendium of reference samples. The very high dimensionality inherent in single cell RNA-seq data makes problematic. Low-dimensional representations, because they compress the space, provide the building blocks for search approaches that can be practically applied across very large datasets such as the HCA. We propose to develop algorithms and software for efficient search over the HCA using low-dimensional representations. + +The primary approach to search in low-dimensional spaces is straightforward: one must create an appropriate low-dimensional representation and identify distance functions that reveal biologically meaningful differences. Ideal low dimension representations are predicted to be much faster to search but also more biologically relevant, because noise is one one of the features that low dimension representations remove. In this aim, we will evaluate candidate representations to identify those with optimal qualities of compression, noise reduction and retention of biologically meaningful features. Current approaches require investigators to perform gene-level quantification on the entirety of a new sample. We aim to enable search during sample preprocessing, prior to gene-level quantification. This will enable in-line annotation of cell types and states and identification of novel features as samples are being processed. We will implement and evaluate techniques to learn and transfer shared low-dimensional representations between the UMI-resolution graph and quantified samples, so that samples where either component is available can be used for search and annotation [CASEY ADD SHARED LATENT SPACE REF]. These UMI-graphs will be embedded in the prior of scCoGAPS and architecture of non-linear latent space techniques. [Do we need this line? It's a bit more specific than the rest of the paragraph -LAG] + +Similarly to the approach by which comparisons to a reference genomes can identify specific differences in a genome of interest, we will use low-dimensional representations from latent spaces to define a reference transcriptome map (the HCA) and use this to quantify differences in target transcriptome maps from new samples of interest. We will leverage common low-dimensional representations and cell-to-cell correlation structure both within and across transcriptome maps from Aim 2 to define this reference. Quantifying the differences between samples characterized at the single-cell level reveals population or individual level differences. [<-- I'm not sure what this sentence means. Please clarify. - LAG] Comparison of scRNA-seq maps from individuals with a particular phenotype to the HCA reference that is computationally infeasible from the large scale of HCA data becomes tractable in these low dimensional spaces. We (PI Hicks) have extensive experience dealing with the distributions of cell expression within and between individuals [@pmid:26040460], which will be critical for defining an appropriate metric to compare references in latent spaces. We plan to implement and evaluate linear mixed models to account for the correlation structure within and between transcriptome maps. This statistical method will be fast, memory-efficient and will be scalable to billions of cells using low-dimensional representations. + +Aim 2 +Rationale: Biological systems are comprised of diverse cell types and states with overlapping molecular phenotypes. Furthermore, biological processes are often reused with modifications across cell types. Low-dimensional representations can identify these shared features, independent of total distance between cells in gene expression space, across large collections of data including the HCA. We will evaluate and select methods that define latent spaces that reflect discrete biological processes or cellular features. These latent spaces can be shared across different biological systems and can reveal context-specific divergence such as pathogenic differences in disease. We propose to establish a central catalog of cell types, states, and biological processes derived from low-dimensional representations of the HCA. + +By establishing a catalog of cellular features using low-dimensional representations can reduce noise and aid in biological interpretability. However, there are currently no standardized, quantitative metrics to determine the extent to which low-dimensional representations capture generalizable biolobical features. We have developed new transfer learning methods to quantify the extent to which latent space representations from one set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947] (PIs Goff & Fertig). These provide a strong foundation to compare low-dimensional representations across different low-dimensional data representation technniques. Generalizable representations should transfer across datasets of related biological contexts, while representations of noise will not. In addition, we have found that combining multiple representations can better capture biological processes across scales [@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will establish a versioned catalog consisting of low-dimensional features learned across both linear and non-linear methods from our base enabling technologies and proposed extensions in Aim 1. + +We will package and version low-dimensional representations and annotate these representations based on their corresponding celluar features (e.g. cell type, tissue, biological process) and deliver these as structured data objects in Bioconductor as well as platform-agnostic data formats. Such summaries and annotations have proven widely successful for the ENCODE, Roadmap Epigenome Mapping, and GTEx projects. We are core package developers and power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of these materials via the AnnotationHub framework. To enable reproducible research leveraging HCA, we will implement a content-based versioning system, which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide sequences using a hash function. We (PI Love) developed hash-based versioning and provenance identification and detection framework for bulk RNA-seq that supports reproducible computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. This will help to avoid scenarios where researchers report on matches to a certain cell type in HCA without precisely defining which definition of that cell type. We will develop F1000Research workflows demonstrating how HCA-defined reference cell types and tools developed in this RFA can be used within a typical genomic data analysis. This catalogue will be used as the basis of defining the references for cell type and state, or individual-specific differences with the linear models proposed in Aim 1. + +Aim 3 +Rationale: Low-dimensional representations for scRNA-seq and HCA data make tasks faster and provide interpretable summaries of complex high-dimensional data. The HCA data associated methods will be valuable to many biomedical fields, but their use will require an understanding of basic bioinformatics, scRNAseq, and how the search algorithms we develop work. This aim addresses these needs in three ways. + +First, we will develop and market a bioinformatic training program for biologist at all levels, including those with no experience in bioinformatics. Lecture materials will make use of materials from previous bioinformatic courses we (PI Hampton) we have run at Mount Desert Island Biological Laboratory, the University of Birmingham, UK and Geisel School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of the success of these learning experiences has to do with our instructional paradigm, which includes a very challenging course project coupled with one-on-one support from instructors. The course offering that will be created in response to this RFA is the most ambitious in our history, because single cell RNA-seq is more challenging than ordinary gene expression analysis. To achieve our training goals, we will develop a new curriculum specifically tailored to HCA, using single cell profiling based on previously developed didactic course material on single cell RNA-Seq analysis for the annual McKusick Short Course on Human and Mammalian Genetics (PI Goff) at Jackson Labs.) machine learning methods, low-dimensional representations, and tools developed by our group in response to this RFA (PI Greene). + +Second, the short course will train not only students, but instructors. Our one-on-one approach to course projects will require a high instructor to student ratio. We will therefore recruit former participants of this class to return in subsequent years, first as teaching assistants, and later as module presenters. We have found that course alumni are eager to improve their teaching resumes, that they learn the material in a new way as they begin to teach it, and that they are an invaluable resource in understanding how to improve the course over time. Part of our strategy is to support this community, which includes many people who will drive the next wave of innovation. All of our course materials, such as slides, code and video lectures will be freely available, enabling course participants to bring what they learned home with them. A capstone session will be included in which we will provide suggestions about how the materials presented in the course can be incorporated into existing course curricula. Course faculty will be available to assist with integration effort after the course. + +Finally, the short course will facilitate scientific collaborations as follows. In addition to engaging in group presentations as part of the class project that will involve scRNA-seq data retrieved from HCA, participants will also provide a 10-minute chalk talk explaining their research interests. This allows course faculty to identify course participants whose needs they can help address, both during and after the course. The course design will include faculty office hours where participants can get help with any bioinformatic problems they are encountering with their ongoing research. + +We will provide this course free to 25 students twice a year, with the understanding that its effective reach will be much greater than the 150 students who enroll. Our students will begin to teach others, and our materials will be incorporated into many local courses. diff --git a/letters/2018-CZI-Goff-Greene.doc b/letters/2018-CZI-Goff-Greene.doc new file mode 100644 index 0000000..fe14c32 Binary files /dev/null and b/letters/2018-CZI-Goff-Greene.doc differ diff --git a/letters/Jax Short Course LoS 2018.docx b/letters/Jax Short Course LoS 2018.docx new file mode 100644 index 0000000..bdacddf Binary files /dev/null and b/letters/Jax Short Course LoS 2018.docx differ diff --git a/submission/LoS/Greene_CZI_2018_LoS.pdf b/submission/LoS/Greene_CZI_2018_LoS.pdf new file mode 100644 index 0000000..8bb18f5 Binary files /dev/null and b/submission/LoS/Greene_CZI_2018_LoS.pdf differ diff --git a/submission/LoS/Hampton Letter of Support.pdf b/submission/LoS/Hampton Letter of Support.pdf new file mode 100644 index 0000000..c462663 Binary files /dev/null and b/submission/LoS/Hampton Letter of Support.pdf differ diff --git a/submission/LoS/Hampton_Dartmouth_letter_of_commitment_11.06.18.pdf b/submission/LoS/Hampton_Dartmouth_letter_of_commitment_11.06.18.pdf new file mode 100644 index 0000000..3e81fd3 Binary files /dev/null and b/submission/LoS/Hampton_Dartmouth_letter_of_commitment_11.06.18.pdf differ diff --git a/submission/LoS/Shaw_Hampton_CZI_2018_LoS.pdf b/submission/LoS/Shaw_Hampton_CZI_2018_LoS.pdf new file mode 100644 index 0000000..164f891 Binary files /dev/null and b/submission/LoS/Shaw_Hampton_CZI_2018_LoS.pdf differ diff --git a/submission/biosketches/Fertig_CZI_2018_Biosketch.pdf b/submission/biosketches/Fertig_CZI_2018_Biosketch.pdf new file mode 100644 index 0000000..9e06129 Binary files /dev/null and b/submission/biosketches/Fertig_CZI_2018_Biosketch.pdf differ diff --git a/submission/biosketches/Hampton_CZI_2018_BioSketch.pdf b/submission/biosketches/Hampton_CZI_2018_BioSketch.pdf new file mode 100644 index 0000000..a973647 Binary files /dev/null and b/submission/biosketches/Hampton_CZI_2018_BioSketch.pdf differ diff --git a/submission/biosketches/Hicks_CZI_2018_biosketch.pdf b/submission/biosketches/Hicks_CZI_2018_biosketch.pdf new file mode 100644 index 0000000..695e2c1 Binary files /dev/null and b/submission/biosketches/Hicks_CZI_2018_biosketch.pdf differ diff --git a/submission/biosketches/greene_czi_2018_biosketch.pdf b/submission/biosketches/greene_czi_2018_biosketch.pdf new file mode 100644 index 0000000..2e03c20 Binary files /dev/null and b/submission/biosketches/greene_czi_2018_biosketch.pdf differ diff --git a/submission/biosketches/love_CZI_2018_biosketch.pdf b/submission/biosketches/love_CZI_2018_biosketch.pdf new file mode 100644 index 0000000..063c63f Binary files /dev/null and b/submission/biosketches/love_CZI_2018_biosketch.pdf differ diff --git a/submission/biosketches/patro_czi_2018_biosketch.pdf b/submission/biosketches/patro_czi_2018_biosketch.pdf new file mode 100644 index 0000000..edb0a03 Binary files /dev/null and b/submission/biosketches/patro_czi_2018_biosketch.pdf differ