The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes across individuals, tissues and disease states -- resolving differences to the level of individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions that preserve the important attributes of the original high dimensional data and yield interpretable, searchable features. For transcriptomic data, compressing on the gene dimension is most attractive: it can be applied to single samples, and genes often provide information about other co-regulated genes or cellular attributes. We hypothesize that using latent space methods to identify low dimensional representations of HCA data will accurately capture biological sources of variability and will be robust to measurement noise. Our seed network incorporates biologists, computer scientists, statisticians, and data scientists from five leading academic institutions who will work collaboratively together to create foundational technologies and educational opportunities that promote effective interpretation of low dimensional representations of HCA data. We will continue our active collaborations with other members of the broader HCA network to integrate state of the art latent space tools, portals, and annotations to enable biological utilization of HCA data through latent spaces.
We will create low-dimensional representations that provide search and catalog capabilities for the HCA. Given both the scale of data, and the inherent complexity of biological systems, we believe these approaches are crucial to the long term success of the HCA. Our central hypothesis is that these approaches will enable faster algorithms while reducing the influence of technical noise. We propose to advance base enabling technologies for low-dimensional representations.
First, we will identify techniques that learn interpretable, biologically-aligned representations. We will consider both linear and non-linear techniques as each may identify distinct components of biological systems. For linear techniques, we rely on our Bayesian, non-negative matrix factorization method scCoGAPS [@doi:10.1101/378950,@doi:10.1101/395004] (PIs Fertig & Goff). This technique learns biologically relevant features across contexts and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is specifically selected as a base enabling technology because its error distribution can naturally account for measurement-specific technical variation [@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature quantifications or spatial information. For non-linear needs, neural networks with multiple layers provide a complementary path to low-dimensional representations [@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will make use of substantial progress that has already been made in both linear and non-linear techniques (e.g., [@doi:10.1101/300681,@doi:10.1101/292037,@doi:10.1101/237065,@doi:10.1101/315556,@doi:10.1101/457879,@doi:10.1016/j.cell.2017.10.023,@doi:10.7717/peerj.2888,@doi:10.1101/459891]). and rigorously evaluate emerging methods into our search and catalog tools. We will extend transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to enable rapid integration, interpretation, and annotation of learned latent spaces. The latent space team from the HCA collaborative networks RFA (including PIs Fertig, Goff, Greene, and Patro) is establishing common definitions and requirements for latent spaces for the HCA, as well as standardized output formats for low-dimensional representations from distinct classes of methods.
Second, we will improve techniques for fast and accurate quantification. Existing approaches for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, etc.) do not account for reads mapping between multiple genes. This affects approximately 15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address this, we will build on our recently developed quantification method for tagged-end data that accounts for reads mapping to multiple genomic loci in a principled and consistent way [@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for scRNA-seq preprocessing. Our tool will support: 1. Exploration of alternative models for Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality control and filtering using the UMI-resolution graph. 3. Creation of a compressed and indexible data structure for the UMI-resolution graph to enable direct access, query, and fast search prior to secondary analysis.
We will implement these base enabling technologies and methods for search, analysis, and latent space transformations as freely available, open source software tools. We will additionally develop platform-agnostic input and output data formats and standards for latent space representations of the HCA data to maximize interoperability. The software tools produced will be fast, scalable, and memory-efficient by leveraging the available assets and expertise of the R/Bioconductor project (PIs Hicks & Love) as well as the broader HCA community.
By using and extending our base enabling technologies, we will provide three principle tools and resources for the HCA. These include 1) software to enable fast and accurate search and annotation using low-dimensional representations of cellular features, 2) a versioned and annotated catalog of latent spaces corresponding to signatures of cell types, states, and biological attributes across the the HCA, and 3) short course and educational materials that will increase the use and impact of low-dimensional representations and the HCA in general.
Rationale: The HCA provides a reference atlas to human cell types, states, and the biological processes in which they engage. The utility of the reference therefore requires that one can easily compare references to each other, or a new sample to the compendium of reference samples. Low-dimensional representations, because they compress the space, provide the building blocks for search approaches that can be practically applied across very large datasets such as the HCA. We propose to develop algorithms and software for efficient search over the HCA using low-dimensional representations.
The primary approach to search in low-dimensional spaces is straightforward: one must create an appropriate low-dimensional representation and identify distance functions that enable biologically meaningful comparisons. Ideal low-dimensional representations are predicted to be much faster to search, and potentially more biologically relevant, as noise can be removed. In this aim, we will evaluate novel, low-dimensional representations to identify those with optimal qualities of compression, noise reduction, and retention of biologically meaningful features. Current scRNA-seq approaches require investigators to perform gene-level quantification on the entirety of a new sample. We aim to search during sample preprocessing, prior to gene-level quantification. This will enable in-line annotation of cell types and states and identification of novel features as samples are being processed. We will implement and evaluate techniques to learn and transfer shared low-dimensional representations between raw or lightly processed data (e.g., kmer representations or UMI-graphs) and quantified samples, so that samples where either quantified or raw data are available can be used for search and annotation [@url:https://github.com/greenelab/shared-latent-space].
Similar to the approach by which comparisons to a reference genomes can identify specific differences in a genome of interest, we will use low-dimensional representations from latent spaces to define a reference transcriptome map (the HCA), and use this to quantify differences in target transcriptome maps from new samples of interest. We will leverage common low-dimensional representations and cell-to-cell correlation structure both within and across transcriptome maps from Aim 2 to define this reference. Quantifying the differences between samples characterized at the single-cell level reveals population or individual level differences. Comparison of scRNA-seq maps from individuals with a particular phenotype to the HCA reference, which is computationally infeasible from the large scale of HCA data, becomes tractable in these low dimensional spaces. We (PI Hicks) have extensive experience dealing with the distributions of cell expression within and between individuals [@pmid:26040460], which will be critical for defining an appropriate metric to compare references in latent spaces. We plan to implement and evaluate linear mixed models to account for the correlation structure within and between transcriptome maps. This statistical method will be fast, memory-efficient and will be scalable to billions of cells using low-dimensional representations.
Rationale: Biological systems are comprised of diverse cell types and states with overlapping molecular phenotypes. Furthermore, biological processes are often reused with modifications across cell types. Low-dimensional representations can identify these shared features, independent of total distance between cells in gene expression space, across large collections of data including the HCA. We will evaluate and select methods that define latent spaces that reflect discrete biological processes or cellular features. These latent spaces can be shared across different biological systems and can reveal context-specific divergence such as pathogenic differences in disease. We propose to establish a versioned catalog of cell types, states, and biological processes derived from low-dimensional representations of the HCA.
Establishing a reference catalog of cellular features using low-dimensional representations can help to reduce noise and aid in biological interpretability. However, there are currently no standardized, quantitative metrics to determine the extent to which low-dimensional representations capture generalizable biological features. We have developed new transfer learning methods to quantify the extent to which latent space representations from one set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] (PIs Greene, Goff & Fertig). These provide a foundation to compare different low-dimensional representations through cross-validation techniques by learning representations in source datasets and testing their ability to transfer into a target dataset. Generalizable representations should also be robust in cross-study validation, transferring across datasets of related biological contexts, while representations of noise will not. In addition, we have found that combining multiple representations can better capture biological processes across scales [@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will establish a catalog consisting of low-dimensional features learned across both linear and non-linear methods from our base enabling technologies and proposed extensions in Aim 1.
We will package and version low-dimensional representations of the HCA and annotate these
latent spaces via their corresponding celluar features. We will deliver these as structured data objects in Bioconductor as well as
platform-agnostic data formats. Where applicable, we will leverage the computational tools
previously developed by Bioconductor for single-cell data access to the HCA, data
representation (SingleCellExperiment
, beachmat
, LinearEmbeddingMatrix
, DelayedArray
,
HDF5Array
and rhdf5
) and data assessment and amelioration of data quality (scater
,
scran
, DropletUtils
). We are core package developers and
power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of
these materials via the AnnotationHub framework. To enable reproducible research
leveraging HCA, we will implement a content-based versioning system,
which identifies versions of the reference catalog by the gene weights and transcript nucleotide
sequences using a hash function. We (PIs Love and Patro) previously developed hash-based versioning and provenance
detection framework for bulk RNA-seq that supports reproducible
computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta].
Our versioning and dissemination of reference latent space catalogs
will help to avoid scenarios where researchers report on matches to a certain cell type in
HCA without precisely defining which definition of that cell type. We will develop
F1000Research workflows demonstrating how HCA-defined reference cell types and tools
developed in this RFA can be used within a typical genomic data analysis. This catalogue
will be used as the basis of defining the references for cell type and state, or
individual-specific differences with the linear models proposed in Aim 1.
Rationale: Low-dimensional representations of scRNA-seq and HCA data make tasks faster and provide interpretable summaries of complex high-dimensional cellular features. The HCA data-associated methods and workflows will be valuable to many biomedical fields, but their use will require an understanding of basic bioinformatics, scRNA-seq, and how the tools being developed work. Furthermore, researchers will need exposure to the conceptual basis of low-dimensional interpretations of biological systems. This aim addresses these needs in three ways.
First, we will develop a bioinformatic training program for biologists at all levels, including those with no experience in bioinformatics. Lecture materials will be extended from existing materials from previous bioinformatic courses we (PI Hampton) have run at Mount Desert Island Biological Laboratory, the University of Birmingham, UK, and Geisel School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of the success of these learning experiences has to do with our instructional paradigm, which includes a very challenging course project coupled with one-on-one support from instructors. We will develop a new curriculum specifically tailored to HCA that incorporates: 1) didactic course material on single cell gene expression profiling (PI Goff), 2) machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional representations, and 5) tools developed by our group in response to this RFA.
Second, the short course will train not only students, but also instructors. Our one-on-one approach to course projects will require a high instructor-to-student ratio. We will therefore recruit former participants of this class to return in subsequent years, first as teaching assistants, and later as module presenters. We have found that course alumni are eager to improve their teaching resumes, that they learn the material in a new way as they begin to teach it, and that they are an invaluable resource in understanding how to improve the course over time. Part of our strategy is to support this community, which includes many people who will drive the next wave of innovation. All of our course materials will be freely available, enabling course participants to bring what they learned home with them. A capstone session will be included in which we will provide suggestions about how the materials presented in the course can be incorporated into existing course curricula. Course faculty will be available to assist with integration effort after the course. Finally, the short course will facilitate scientific collaborations by engaging participants in utilizing these tools for collaborative research efforts.
[I feel like we are missing a concluding summary of broader impacts to pull this together - could be a brief bulleted summary of tools required by app as Andrew suggested - EJF]