From e83e51dcd89256403bb787c3d9a46e4ee8d04a9e Mon Sep 17 00:00:00 2001 From: Daniel Himmelstein Date: Sat, 6 Oct 2018 10:01:27 -0400 Subject: [PATCH 1/8] CSL: display="block" every other group Merges https://github.com/greenelab/manubot-rootstock/pull/134 Previous blocks were causing a blank line after the author line before the journal-date-URL line in markdown output. Placing blocks at every other line seems to be the solution required for pandoc-citeproc / pandoc to properly place the newlines. --- build/assets/style.csl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/build/assets/style.csl b/build/assets/style.csl index 756c758..07ba262 100644 --- a/build/assets/style.csl +++ b/build/assets/style.csl @@ -51,7 +51,7 @@ - + @@ -60,7 +60,7 @@ - + From 830cdc088dee2a61452bd56d990e0277daa0a9af Mon Sep 17 00:00:00 2001 From: Daniel Himmelstein Date: Mon, 8 Oct 2018 15:12:33 -0400 Subject: [PATCH 2/8] Update USAGE.md & environment on 2018-10-08 Merges https://github.com/greenelab/manubot-rootstock/pull/135 Closes https://github.com/greenelab/manubot/issues/59 * USAGE.md: suggested citation & acknowledgments Mostly copied from https://github.com/greenelab/manubot/blob/9d97ec347882bcd85ab6aee7a3b4734105ebfc15/README.md * Update environment on 2018-10-08 Updates to https://github.com/greenelab/manubot@9d97ec347882bcd85ab6aee7a3b4734105ebfc15 which is slightly past the manubot v0.2.0 release. jsonschema & jsonref should have been added to the environment previousely since they are required by manubot for JSON schema validation, but were not added. Do not upgrade to Python 3.7 due to collections DeprecationWarnings in several packages. Fixed warning in https://github.com/gazpachoking/jsonref/pull/26, but several remain. * Travis CI: update miniconda to 4.5.11 --- .travis.yml | 2 +- USAGE.md | 17 +++++++++++++++++ build/environment.yml | 10 ++++++---- 3 files changed, 24 insertions(+), 5 deletions(-) diff --git a/.travis.yml b/.travis.yml index 60cc09a..29420f4 100644 --- a/.travis.yml +++ b/.travis.yml @@ -5,7 +5,7 @@ branches: only: - master before_install: - - wget https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh + - wget https://repo.continuum.io/miniconda/Miniconda3-4.5.11-Linux-x86_64.sh --output-document miniconda.sh - bash miniconda.sh -b -p $HOME/miniconda - source $HOME/miniconda/etc/profile.d/conda.sh diff --git a/USAGE.md b/USAGE.md index df02493..ddcdc07 100644 --- a/USAGE.md +++ b/USAGE.md @@ -167,3 +167,20 @@ For additional examples, check out existing manuscripts that use the Manubot (so + The Manubot 2018 Development Proposal ([source](https://github.com/greenelab/manufund-2018), [manuscript](https://greenelab.github.io/manufund-2018/)) If you are using the Manubot, feel free to submit a pull request to add your manuscript to the list above. + +## Citing Manubot + +To cite the Manubot project or for more information on its design and history, see `@url:https://greenelab.github.io/meta-review/`: + +> **Open collaborative writing with Manubot**
+Daniel S. Himmelstein, David R. Slochower, Venkat S. Malladi, Casey S. +Greene, Anthony Gitter
+_Manubot Preprint_ (2018) + +## Acknowledgments + +We would like to thank the contributors and funders whose support makes the Manubot project possible. +Specifically, Manubot development has been financially supported by: + +- the **Alfred P. Sloan Foundation** in [Grant G-2018-11163](https://sloan.org/grant-detail/8501) to [**@dhimmel**](https://github.com/dhimmel). +- the **Gordon & Betty Moore Foundation** ([**@DDD-Moore**](https://github.com/DDD-Moore)) in [Grant GBMF4552](https://www.moore.org/grant-detail?grantId=GBMF4552) to [**@cgreene**](https://github.com/cgreene). diff --git a/build/environment.yml b/build/environment.yml index a77ba34..cc38436 100644 --- a/build/environment.yml +++ b/build/environment.yml @@ -6,15 +6,17 @@ dependencies: - conda-forge::cffi=1.11.5 - conda-forge::ghp-import=0.5.5 - conda-forge::jinja2=2.10 + - conda-forge::jsonschema=2.6.0 - conda-forge::pandas=0.23.4 - - conda-forge::pandoc=2.2.2 - - conda-forge::python=3.6.6 - - conda-forge::pyyaml=3.12 + - conda-forge::pandoc=2.3.1 + - conda-forge::python=3.6.5 + - conda-forge::pyyaml=3.13 - conda-forge::requests=2.19.1 - conda-forge::watchdog=0.8.3 - pip: - errorhandler==2.0.1 - - git+https://github.com/greenelab/manubot@66a6efc6f4b84153a813aa423ec00725ed1417c5 + - git+https://github.com/greenelab/manubot@9d97ec347882bcd85ab6aee7a3b4734105ebfc15 + - jsonref==0.2 - opentimestamps-client==0.6.0 - opentimestamps==0.4.0 - pandoc-eqnos==1.3.0 From 6bb804c018d40f0ef4e7f79fcfa5600827f112ca Mon Sep 17 00:00:00 2001 From: Anthony Gitter Date: Tue, 9 Oct 2018 11:20:08 -0500 Subject: [PATCH 3/8] Case sensitive environment variables in setup instructions Merges https://github.com/greenelab/manubot-rootstock/pull/138 Closes https://github.com/greenelab/manubot-rootstock/issues/137 * Case sensitive environment variables * Update phrasing borrowing dhimmel ideas * Change account to username --- SETUP.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/SETUP.md b/SETUP.md index 554cabd..22a4f06 100644 --- a/SETUP.md +++ b/SETUP.md @@ -9,10 +9,12 @@ Setup is supported on Linux and macOS, but [**not on Windows**](https://github.c First, you must configure two environment variables (`OWNER` and `REPO`). These variables specify the GitHub repository for the manuscript (i.e. `https://github.com/OWNER/REPO`). +Make sure that the case of `OWNER` matches how your username is displayed on GitHub. +In general, assume that all commands in this setup are case-sensitive. **Edit the following commands with your manuscript's information:** ```sh -# GitHub account (change from greenelab) +# GitHub username (change from greenelab) OWNER=greenelab # Repository name (change from manubot-rootstock) REPO=manubot-rootstock From dcfe402544f90e0047a247cdd8d01a2bca08df54 Mon Sep 17 00:00:00 2001 From: Daniel Himmelstein Date: Tue, 23 Oct 2018 15:23:20 -0400 Subject: [PATCH 4/8] Bugfix environment updates on 2018-10-23 Merges https://github.com/greenelab/manubot-rootstock/pull/140 Closes https://github.com/greenelab/manubot-rootstock/issues/136 Updates Manubot to fix empty date-parts issue Fix opentimestamps incompatible pinned versions --- build/environment.yml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/build/environment.yml b/build/environment.yml index cc38436..73171db 100644 --- a/build/environment.yml +++ b/build/environment.yml @@ -15,16 +15,16 @@ dependencies: - conda-forge::watchdog=0.8.3 - pip: - errorhandler==2.0.1 - - git+https://github.com/greenelab/manubot@9d97ec347882bcd85ab6aee7a3b4734105ebfc15 + - git+https://github.com/greenelab/manubot@4e6a0f6d28220264c0d7892c732cb68c3e97c07a - jsonref==0.2 - opentimestamps-client==0.6.0 - - opentimestamps==0.4.0 + - opentimestamps==0.3.0 - pandoc-eqnos==1.3.0 - pandoc-fignos==1.3.0 - pandoc-tablenos==1.3.0 - pandoc-xnos==1.1.1 - pybase62==0.4.0 - pysha3==1.0.2 - - python-bitcoinlib==0.10.1 + - python-bitcoinlib==0.9.0 - requests-cache==0.4.13 - weasyprint==0.42.3 From 895924e6a8c693a0ee110ec410b153b7ceb09bc2 Mon Sep 17 00:00:00 2001 From: Daniel Himmelstein Date: Tue, 30 Oct 2018 15:28:24 -0400 Subject: [PATCH 5/8] Enable raw citations & small USAGE updates Merges https://github.com/greenelab/manubot-rootstock/pull/141 * Update USAGE with current URLs & recs Use HTTPS URLs where possible Switch SVG embedding solution from rawgit.com, which is shutting down, to the native GitHub solution. https://twitter.com/dhimmel/status/1049361799244664834 * Enable raw citations Update USAGE.md with raw citation example. Also update citations in delete-me.md. --- USAGE.md | 44 +++++++++++++++++++++++++++-------------- build/environment.yml | 2 +- content/02.delete-me.md | 4 ++-- 3 files changed, 32 insertions(+), 18 deletions(-) diff --git a/USAGE.md b/USAGE.md index ddcdc07..5bd3507 100644 --- a/USAGE.md +++ b/USAGE.md @@ -1,18 +1,18 @@ # Manubot usage guidelines -This repository uses the [Manubot](https://github.com/greenelab/manubot) to automatically produce a manuscript from its source. +This repository uses [Manubot](https://github.com/greenelab/manubot-rootstock) to automatically produce a manuscript from the source in the [`content`](content) directory. ## Manubot markdown Manuscript text should be written in markdown files, which should be located in [`content`](content) with a digit prefix for ordering (e.g. `01.`, `02.`, etc.) and a `.md` extension. -For basic formatting, check out the [CommonMark Help](http://commonmark.org/help/) page for an introduction to the formatting options provided by standard markdown. -In addition, manubot supports an extended version of markdown, tailored for scholarly writing, which includes [Pandoc's Markdown](http://pandoc.org/MANUAL.html#pandocs-markdown) and the extensions discussed below. +For basic formatting, check out the [CommonMark Help](https://commonmark.org/help/) page for an introduction to the formatting options provided by standard markdown. +In addition, Manubot supports an extended version of markdown, tailored for scholarly writing, which includes [Pandoc's Markdown](https://pandoc.org/MANUAL.html#pandocs-markdown) and the extensions discussed below. Within a paragraph in markdown, single newlines are interpreted as whitespace (same as a space). A paragraph's source does not need to contain newlines. However, "one paragraph per line" makes the git diff less precise, leading to less granular review commenting, and makes conflicts more likely. -Therefore, we recommend using [semantic linefeeds](http://rhodesmill.org/brandon/2012/one-sentence-per-line/ "Semantic Linefeeds. Brandon Rhodes. 2012") — newlines between sentences. +Therefore, we recommend using [semantic linefeeds](https://rhodesmill.org/brandon/2012/one-sentence-per-line/ "Semantic Linefeeds. Brandon Rhodes. 2012") — newlines between sentences. We have found that "one sentence per line" is preferable to "word wrap" or "one paragraph per line". ### Tables @@ -30,7 +30,7 @@ Table: Caption for this example table. {#tbl:example-id} Support for table numbering and citation is provided by [`pandoc-tablenos`](https://github.com/tomduck/pandoc-tablenos). Above, `{#tbl:example-id}` sets the table ID, which creates an HTML anchor and allows citing the table like `@tbl:example-id`. -For easy creation of markdown tables, check out the [Tables Generator](http://www.tablesgenerator.com/markdown_tables) webapp. +For easy creation of markdown tables, check out the [Tables Generator](https://www.tablesgenerator.com/markdown_tables) webapp. ### Figures @@ -45,11 +45,11 @@ This figure can be cited in the text using `@fig:example-id`. In context, a figure citation may look like: `Figure {@fig:example-id}B shows …`. For images created by the manuscript authors that are hosted elsewhere on GitHub, we recommend using a [versioned](https://help.github.com/articles/getting-permanent-links-to-files/) GitHub URL to embed figures, thereby preserving exact image provenance. -When embedding SVG images hosted on GitHub, passing the URL through [RawGit](https://rawgit.com/) is necessary. -An example of a URL that has been passed through RawGit is: +When embedding SVG images hosted on GitHub, it's necessary to append `?sanitize=true` to the `raw.githubusercontent.com` URL. +For example: ``` -https://cdn.rawgit.com/greenelab/scihub/572d6947cb958e797d1a07fdb273157ad9154273/figure/citescore.svg +https://raw.githubusercontent.com/greenelab/scihub/572d6947cb958e797d1a07fdb273157ad9154273/figure/citescore.svg?sanitize=true ``` Figures placed in the [`content/images`](content/images) directory can be embedded using their relative path. @@ -59,19 +59,19 @@ For example, we embed an [ORCID](https://orcid.org/) icon inline using: ![ORCID icon](images/orcid.svg){height="13px"} ``` -The bracketed text following the image declaration is interpreted by Pandoc's [`link_attributes`](http://pandoc.org/MANUAL.html#extension-link_attributes) extension. +The bracketed text following the image declaration is interpreted by Pandoc's [`link_attributes`](https://pandoc.org/MANUAL.html#extension-link_attributes) extension. For example, the following will override the figure number to be "S1" and set the image width to 5 inches: ```md {#fig:supplement tag="S1" width="5in"} ``` -We recommend always specifying the width of SVG images (even if just `width="100%"`), since otherwise SVGs may not render properly in the [WeasyPrint](http://weasyprint.org/) PDF export. +We recommend always specifying the width of SVG images (even if just `width="100%"`), since otherwise SVGs may not render properly in the [WeasyPrint](https://weasyprint.org/) PDF export. ### Citations -Manubot supports Pandoc [citations](http://pandoc.org/MANUAL.html#citations) via `pandoc-citeproc`. -However, Manubot performs automated citation processing and metadata retrieval on raw citations. +Manubot supports Pandoc [citations](https://pandoc.org/MANUAL.html#citations) via `pandoc-citeproc`. +However, Manubot performs automated citation processing and metadata retrieval on in-text citations. Therefore, citations must be of the following form: `@source:identifier`, where `source` is one of the options described below. When choosing which source to use for a citation, we recommend the following order: @@ -80,6 +80,8 @@ When choosing which source to use for a citation, we recommend the following ord 3. PubMed ID, cite like `@pmid:26158728`. 4. _arXiv_ ID, cite like `@arxiv:1508.06576v2`. 5. URL / webpage, cite like `@url:http://openreview.net/pdf?id=Sk-oDY9ge`. +6. For references that do not have any of the persistent identifiers above, use a raw citation like `@raw:old-manuscript`. +Metadata for raw citations must be provided manually. Cite multiple items at once like: @@ -109,8 +111,8 @@ The Manubot workflow requires the bibliographic details for references (the set The Manubot attempts to automatically retrieve metadata and generate valid citeproc JSON for references, which is exported to `output/references.json`. However, in some cases the Manubot fails to retrieve metadata or generates incorrect or incomplete citeproc metadata. Errors are most common for `url` references. -For these references, you can manually specify the correct citeproc in [`content/manual-references.json`](content/manual-references.json), which will override the automatically generated reference data. -To do so, create a new citeproc record that contains the field `"standard_citation"` with the appropriate reference identifier as its value. +For these references, you can manually specify the correct CSL Data in [`content/manual-references.json`](content/manual-references.json), which will override the automatically generated reference data. +To do so, create a new CSL JSON Item that contains the field `"standard_citation"` with the appropriate reference identifier as its value. The identifier can be obtained from the `standard_citation` column of `citations.tsv`, which is located in the `output` branch or in the `output` subdirectory of local builds. As an example, `manual-references.json` contains: @@ -118,11 +120,23 @@ As an example, `manual-references.json` contains: "standard_citation": "url:https://github.com/greenelab/manubot-rootstock" ``` +The metadata for `raw` citations must be provided in `manual-references.json` or an error will occur. +For example, to cite `@raw:private-message` in a manuscript, a corresponding CSL Item in `manual-references.json` is required, such as: + +```json +{ + "type": "personal_communication", + "standard_citation": "raw:private-message", + "title": "Personal communication with Doctor X" +} +``` + +All references provided in `manual-references.json` must provide values for the `type` and `standard_citation` fields. For guidance on what CSL JSON should be like for different document types, refer to [these examples](https://github.com/aurimasv/zotero-import-export-formats/blob/a51c342e66bebd97b73a7230047b801c8f7bb690/CSL%20JSON.json). ## Manuscript metadata -[`content/metadata.yaml`](content/metadata.yaml) contains manuscript metadata that gets passed through to Pandoc, via a [`yaml_metadata_block`](http://pandoc.org/MANUAL.html#extension-yaml_metadata_block). +[`content/metadata.yaml`](content/metadata.yaml) contains manuscript metadata that gets passed through to Pandoc, via a [`yaml_metadata_block`](https://pandoc.org/MANUAL.html#extension-yaml_metadata_block). `metadata.yaml` should contain the manuscript `title`, `authors` list, `keywords`, and `lang` ([language tag](https://www.w3.org/International/articles/language-tags/ "W3C: Language tags in HTML and XML")). Additional metadata, such as `date`, will automatically be created by the Manubot. Manubot uses the [timezone](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) specified in [`build.sh`](build/build.sh) for setting the manuscript's date. diff --git a/build/environment.yml b/build/environment.yml index 73171db..77d7a50 100644 --- a/build/environment.yml +++ b/build/environment.yml @@ -15,7 +15,7 @@ dependencies: - conda-forge::watchdog=0.8.3 - pip: - errorhandler==2.0.1 - - git+https://github.com/greenelab/manubot@4e6a0f6d28220264c0d7892c732cb68c3e97c07a + - git+https://github.com/greenelab/manubot@a008126b39e3bd4b80ebaa5af9f9fa2f30b3a670 - jsonref==0.2 - opentimestamps-client==0.6.0 - opentimestamps==0.3.0 diff --git a/content/02.delete-me.md b/content/02.delete-me.md index ba79516..6484a4e 100644 --- a/content/02.delete-me.md +++ b/content/02.delete-me.md @@ -6,10 +6,10 @@ The Manubot is a system for automating scholarly publishing. Content is written in [Pandoc Markdown](http://pandoc.org/MANUAL.html#pandocs-markdown) source files. See [`USAGE.md`](https://github.com/greenelab/manubot-rootstock/blob/master/USAGE.md) for more information on how to use the Manubot. -The Manubot project began with the [Deep Review](https://github.com/greenelab/deep-review), where it was used to compose a highly-collaborative review article [@doi:10.1101/142760]. +The Manubot project began with the [Deep Review](https://github.com/greenelab/deep-review), where it was used to compose a highly-collaborative review article [@doi:10.1098/rsif.2017.0387]. Other manuscripts that were created with Manubot include: -+ The Sci-Hub Coverage Study ([GitHub](https://github.com/greenelab/scihub-manuscript), [HTML manuscript](https://greenelab.github.io/scihub-manuscript/)) ++ The Sci-Hub Coverage Study ([GitHub](https://github.com/greenelab/scihub-manuscript), [HTML manuscript](https://greenelab.github.io/scihub-manuscript/)) [@doi:10.7554/eLife.32822] + Michael Zietz's Report for the Vagelos Scholars Program ([GitHub](https://github.com/zietzm/Vagelos2017), [HTML manuscript](https://zietzm.github.io/Vagelos2017/)) [@doi:10.6084/m9.figshare.5346577] If you notice a problem with Manubot, it's best to submit an upstream fix to the appropriate repository: From 135f45193f2bd1f6ce2aa95e06973d19073286d5 Mon Sep 17 00:00:00 2001 From: Daniel Himmelstein Date: Fri, 2 Nov 2018 13:23:22 -0400 Subject: [PATCH 6/8] Ignore OS specific files Merges https://github.com/greenelab/manubot-rootstock/pull/142 --- .gitignore | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/.gitignore b/.gitignore index 6befc7c..ec87fe3 100644 --- a/.gitignore +++ b/.gitignore @@ -17,3 +17,17 @@ __pycache__/ # Misc temporary files *.bak +# Operating system specific files + +## Linux +*~ +.Trash-* + +## macOS +.DS_Store +._* +.Trashes + +## Windows +Thumbs.db +[Dd]esktop.ini From d6c85ed2feb32ce36f5f4af27e6fa54ed7e0cd5e Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Mon, 12 Nov 2018 14:26:40 -0500 Subject: [PATCH 7/8] add building word doc --- .travis.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.travis.yml b/.travis.yml index 29420f4..317e9a6 100644 --- a/.travis.yml +++ b/.travis.yml @@ -4,6 +4,8 @@ language: generic branches: only: - master +env: + - BUILD_DOCX=true before_install: - wget https://repo.continuum.io/miniconda/Miniconda3-4.5.11-Linux-x86_64.sh --output-document miniconda.sh From 5d232e115fb09029974c01e826fb7aacb0e6707c Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Mon, 12 Nov 2018 14:37:56 -0500 Subject: [PATCH 8/8] try to finesse the aim 1 shared space --- content/04.body.md | 308 ++++++++++++++++++++++----------------------- 1 file changed, 152 insertions(+), 156 deletions(-) diff --git a/content/04.body.md b/content/04.body.md index 0a87464..91052b6 100644 --- a/content/04.body.md +++ b/content/04.body.md @@ -1,220 +1,216 @@ ## Proposal Body (2000 words) -The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes -across individuals, tissues and disease states -- resolving differences to the level of -individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically -high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions +The Human Cell Atlas (HCA) provides unprecedented characterization of molecular phenotypes +across individuals, tissues and disease states -- resolving differences to the level of +individual cells. This dataset provides an extraordinary opportunity for scientific advancement, enabled by new tools to rapidly query, characterize, and analyze these intrinsically +high-dimensional data. To facilitate this, our seed network proposes to compress HCA data into fewer dimensions that preserve the important attributes of the original high dimensional data and yield -interpretable, searchable features. For transcriptomic data, compressing on the gene -dimension is most attractive: it can be applied to single samples, and genes often provide +interpretable, searchable features. For transcriptomic data, compressing on the gene +dimension is most attractive: it can be applied to single samples, and genes often provide information about other co-regulated genes or cellular attributes. We hypothesize that building an ensemble of low dimensional representations across latent space methods will provide a -reduced dimensional space that captures biological sources of variability and is robust to measurement noise. Our seed network will -incorporate biologists and computer scientists from five leading academic institutions who will work together to create foundational technologies +reduced dimensional space that captures biological sources of variability and is robust to measurement noise. Our seed network will +incorporate biologists and computer scientists from five leading academic institutions who will work together to create foundational technologies and educational opportunities that promote effective interpretation of low dimensional representations of HCA data. We will continue our active collaborations with other members of the broader HCA network to integrate state of the art latent space tools, portals, and annotations to enable biological utilization of HCA data through latent spaces. ## Scientific Goals -We will create low-dimensional representations that provide search and catalog capabilities -for the HCA. Given both the scale of data, and the inherent complexity of biological -systems, we believe these approaches are crucial to the long term success of the HCA. Our -**__central hypothesis__** is that these approaches will enable faster algorithms while -reducing the influence of technical noise. We propose to advance **__base enabling +We will create low-dimensional representations that provide search and catalog capabilities +for the HCA. Given both the scale of data, and the inherent complexity of biological +systems, we believe these approaches are crucial to the long term success of the HCA. Our +**__central hypothesis__** is that these approaches will enable faster algorithms while +reducing the influence of technical noise. We propose to advance **__base enabling technologies__** for low-dimensional representations. -First, we will identify techniques that learn interpretable, biologically-aligned -representations. We will consider both linear and non-linear techniques as each may identify +First, we will identify techniques that learn interpretable, biologically-aligned +representations. We will consider both linear and non-linear techniques as each may identify distinct components of biological systems. For linear techniques, we rely on our Bayesian, non-negative matrix factorization method scCoGAPS [@doi:10.1101/378950,@doi:10.1101/395004] -(PIs Fertig & Goff). This technique learns biologically relevant features across contexts -and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], -including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is -specifically selected as a base enabling technnology because its error distribution can -naturally account for measurement-specific technical variation -[@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature -quantifications or spatial information. For non-linear needs, neural networks with multiple -layers provide a complementary path to low-dimensional representations -[@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will -make use of substantial progress that has already been made in both linear and non-linear +(PIs Fertig & Goff). This technique learns biologically relevant features across contexts +and data modalities [@doi:10.1186/1471-2164-13-160,@doi:10.18632/oncotarget.12075,@doi:10.1007/978-1-62703-721-1_6,@doi:10.1186/s13073-018-0545-2,@doi:10.1101/378950], +including notably the HPN DREAM8 challenge [@doi:10.1038/nmeth.3773]. This technique is +specifically selected as a base enabling technnology because its error distribution can +naturally account for measurement-specific technical variation +[@doi:10.1371/journal.pone.0078127] and its prior distributions for different feature +quantifications or spatial information. For non-linear needs, neural networks with multiple +layers provide a complementary path to low-dimensional representations +[@doi:10.1101/385534] (PI Greene) that model these diverse features of HCA data. We will +make use of substantial progress that has already been made in both linear and non-linear techniques (e.g., [@doi:10.1101/300681,@doi:10.1101/292037,@doi:10.1101/237065,@doi:10.1101/315556,@doi:10.1101/457879,@doi:10.1016/j.cell.2017.10.023,@doi:10.7717/peerj.2888,@doi:10.1101/459891]). -and rigorously evaluate emerging methods into our search and catalog tools. We will extend -transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to -enable rapid integration, interpretation, and annotation of learned latent spaces. The +and rigorously evaluate emerging methods into our search and catalog tools. We will extend +transfer learning methods, including ProjectR [@doi:10.1101/395004] (PIs Goff & Fertig) to +enable rapid integration, interpretation, and annotation of learned latent spaces. The latent space team from the HCA collaborative networks RFA (including PIs Fertig, Goff, -Greene, and Patro) is establishing common definitions and requirements for latent spaces -for the HCA, as well as standardized output formats for low-dimensional representations from +Greene, and Patro) is establishing common definitions and requirements for latent spaces +for the HCA, as well as standardized output formats for low-dimensional representations from distinct classes of methods. -Second, we will improve techniques for fast and accurate quantification. Existing approaches -for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, -etc.) do not account for reads mapping between multiple genes. This affects approximately -15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and -leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address -this, we will build on our recently developed quantification method for tagged-end data that -accounts for reads mapping to multiple genomic loci in a principled and consistent way -[@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for -scRNA-Seq preprocessing. Our tool will support: 1. Exploration of alternative models for -Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality -control and filtering using the UMI-resolution graph. 3. Creation of a compressed and -indexible data structure for the UMI-resolution graph to enable direct access, query, and +Second, we will improve techniques for fast and accurate quantification. Existing approaches +for scRNA-seq data using tagged-end end protocols (e.g. 10x Chromium, drop-Seq, inDrop, +etc.) do not account for reads mapping between multiple genes. This affects approximately +15-25% of the reads generated in a typical experiment, reducing quantification accuracy, and +leading to systematic biases in gene expression estimates [@doi:10.1101/335000]. To address +this, we will build on our recently developed quantification method for tagged-end data that +accounts for reads mapping to multiple genomic loci in a principled and consistent way +[@doi:10.1101/335000] (PI Patro), and extend this into a production quality tool for +scRNA-Seq preprocessing. Our tool will support: 1. Exploration of alternative models for +Unique Molecular Identifier (UMI) resolution. 2. Development of new approaches for quality +control and filtering using the UMI-resolution graph. 3. Creation of a compressed and +indexible data structure for the UMI-resolution graph to enable direct access, query, and fast search prior to secondary analysis. We will implement these base enabling technologies and methods for search, -analysis, and latent space transformations as freely available, open source software tools. -We will additionally develop platform-agnostic input and output data formats and standards -for latent space representations of the HCA data to maximize interoperability. The software -tools produced will be fast, scalable, and memory-efficient by leveraging the available -assets and expertises of the R/Bioconductor project (PIs Hicks & Love) as well as the +analysis, and latent space transformations as freely available, open source software tools. +We will additionally develop platform-agnostic input and output data formats and standards +for latent space representations of the HCA data to maximize interoperability. The software +tools produced will be fast, scalable, and memory-efficient by leveraging the available +assets and expertises of the R/Bioconductor project (PIs Hicks & Love) as well as the broader HCA community. -By using and extending our base enabling technologies, we will provide three principle -tools and resources for the HCA. These include 1) software to enable fast and accurate -search and annotation using low-dimensional representations of cellular features, 2) a -versioned and annotated catalog of latent spaces corresponding to signatures of cell types, -states, and biological attributes across the the HCA, and 3) short course and educational -materials that will increase the use and impact of low-dimensional representations and the +By using and extending our base enabling technologies, we will provide three principle +tools and resources for the HCA. These include 1) software to enable fast and accurate +search and annotation using low-dimensional representations of cellular features, 2) a +versioned and annotated catalog of latent spaces corresponding to signatures of cell types, +states, and biological attributes across the the HCA, and 3) short course and educational +materials that will increase the use and impact of low-dimensional representations and the HCA in general. ### Aim 1 -*Rationale:* The HCA provides a reference atlas to human cell types, states, and the -biological processes in which they engage. The utility of the reference therefore requires -that one can easily compare references to each other, or a new sample to the compendium of -reference samples. Low-dimensional representations, because they compress the space, provide -the building blocks for search approaches that can be practically applied across very large -datasets such as the HCA. *We propose to develop algorithms and software for efficient +*Rationale:* The HCA provides a reference atlas to human cell types, states, and the +biological processes in which they engage. The utility of the reference therefore requires +that one can easily compare references to each other, or a new sample to the compendium of +reference samples. Low-dimensional representations, because they compress the space, provide +the building blocks for search approaches that can be practically applied across very large +datasets such as the HCA. *We propose to develop algorithms and software for efficient search over the HCA using low-dimensional representations.* The primary approach to search in low-dimensional spaces is straightforward: one -must create an appropriate low-dimensional representation and identify distance functions -that enable biologically meaningful comparisons. Ideal low-dimensional representations are -predicted to be much faster to search, and potentially more biologically relevant, as noise -can be removed. In this aim, we will evaluate novel low-dimensional representations to -identify those with optimal qualities of compression, noise reduction, and retention of -biologically meangful features. Current scRNA-Seq approaches require investigators to -perform gene-level quantification on the entirety of a new sample. We aim to enable search -during sample preprocessing, prior to gene-level quantification. This will enable in-line -annotation of cell types and states and identification of novel features as samples are -being processed. We will implement and evaluate techniques to learn and transfer shared -low-dimensional representations between the UMI-resolution graph and quantified samples, so -that samples where either component is available can be used for search and annotation -**[CASEY ADD SHARED LATENT SPACE REF]**. These UMI-graphs will be embedded in the prior of -scCoGAPS and architecture of non-linear latent space techniques. **[Do we need this line? -It's a bit more specific than the rest of the paragraph -LAG]** -**[I think we need something to link in how this fits to the latent space methods -- maybe not so specific, but something that ties it back beyond preprocessing - EJF]** - -Similarly to the approach by which comparisons to a reference genomes can identify specific -differences in a genome of interest, we will use low-dimensional representations from latent -spaces to define a reference transcriptome map (the HCA), and use this to quantify -differences in target transcriptome maps from new samples of interest. We will leverage -common low-dimensional representations and cell-to-cell correlation structure both within -and across transcriptome maps from Aim 2 to define this reference. Quantifying the -differences between samples characterized at the single-cell level reveals population or -individual level differences. +must create an appropriate low-dimensional representation and identify distance functions +that enable biologically meaningful comparisons. Ideal low-dimensional representations are +predicted to be much faster to search, and potentially more biologically relevant, as noise +can be removed. In this aim, we will evaluate novel low-dimensional representations to +identify those with optimal qualities of compression, noise reduction, and retention of +biologically meaningful features. Current scRNA-Seq approaches require investigators to +perform gene-level quantification on the entirety of a new sample. We aim to enable search +during sample preprocessing, prior to gene-level quantification. This will enable in-line +annotation of cell types and states and identification of novel features as samples are +being processed. We will implement and evaluate techniques to learn and transfer shared +low-dimensional representations between read-based data (e.g., kmer representations) and quantified samples, so +that samples where either quantified or read data is available can be used for search and annotation +[@url:https://github.com/greenelab/shared-latent-space]. + +Similarly to the approach by which comparisons to a reference genomes can identify specific +differences in a genome of interest, we will use low-dimensional representations from latent +spaces to define a reference transcriptome map (the HCA), and use this to quantify +differences in target transcriptome maps from new samples of interest. We will leverage +common low-dimensional representations and cell-to-cell correlation structure both within +and across transcriptome maps from Aim 2 to define this reference. Quantifying the +differences between samples characterized at the single-cell level reveals population or +individual level differences. **[<-- I'm not sure what this sentence means. Please clarify. - LAG]** **[My take is that it means if we have an average from the catalogue we've built for a cell type or state, that deviations in particular samples could yield context-specific differences, not sure how to reword - EJF]** -Comparison of scRNA-seq maps from individuals with a particular phenotype -to the HCA reference, which is computationally infeasible from the large scale of HCA data, +Comparison of scRNA-seq maps from individuals with a particular phenotype +to the HCA reference, which is computationally infeasible from the large scale of HCA data, becomes tractable in these low dimensional spaces. We (PI Hicks) have extensive experience dealing with the distributions of cell expression within and between individuals [@pmid:26040460], which will be critical for defining an appropriate -metric to compare references in latent spaces. We plan to implement and evaluate -linear mixed models to account for the correlation structure within and between -transcriptome maps. This statistical method will be fast, memory-efficient and will +metric to compare references in latent spaces. We plan to implement and evaluate +linear mixed models to account for the correlation structure within and between +transcriptome maps. This statistical method will be fast, memory-efficient and will be scalable to billions of cells using low-dimensional representations. ### Aim 2 *Rationale:* Biological systems are comprised of diverse cell types and states with overlapping molecular phenotypes. Furthermore, biological processes are often reused with -modifications across cell types. Low-dimensional representations can identify these shared -features, independent of total distance between cells in gene expression space, across large -collections of data including the HCA. We will evaluate and select methods that define -latent spaces that reflect discrete biological processes or cellular features. These latent -spaces can be shared across different biological systems and can reveal context-specific -divergence such as pathogenic differences in disease. *We propose to establish a central -catalog of cell types, states, and biological processes derived from low-dimensional +modifications across cell types. Low-dimensional representations can identify these shared +features, independent of total distance between cells in gene expression space, across large +collections of data including the HCA. We will evaluate and select methods that define +latent spaces that reflect discrete biological processes or cellular features. These latent +spaces can be shared across different biological systems and can reveal context-specific +divergence such as pathogenic differences in disease. *We propose to establish a central +catalog of cell types, states, and biological processes derived from low-dimensional representations of the HCA.* -Establishing a catalog of cellular features using low-dimensional representations can -reduce noise and aid in biological interpretability. However, there are currently no -standardized, quantitative metrics to determine the extent to which low-dimensional -representations capture generalizable biological features. We have developed new transfer +Establishing a catalog of cellular features using low-dimensional representations can +reduce noise and aid in biological interpretability. However, there are currently no +standardized, quantitative metrics to determine the extent to which low-dimensional +representations capture generalizable biological features. We have developed new transfer learning methods to quantify the extent to which latent space representations from one -set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] +set of training data are represented in another [@doi:10.1101/395004,@doi:10.1101/395947,@doi:10.1101/395947] (PIs Greene, Goff & Fertig). These provide a strong foundation to compare different low-dimensional representations and techniques for learning and transferring knowledge between them [**<-- didn't understand what was here before too well, please make sure I didn't muck with the meaning too much.**] -Generalizable representations should transfer across datasets of related biological -contexts, while representations of noise will not. In addition, we have found that combining -multiple representations can better capture biological processes across scales -[@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, -valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will -establish a catalog consisting of low-dimensional features learned across both +Generalizable representations should transfer across datasets of related biological +contexts, while representations of noise will not. In addition, we have found that combining +multiple representations can better capture biological processes across scales +[@doi:10.1016/j.cels.2017.06.003], and that representations across scales capture distinct, +valid biological signatures [@doi:10.1371/journal.pone.0078127]. Therefore, we will +establish a catalog consisting of low-dimensional features learned across both linear and non-linear methods from our base enabling technologies and proposed extensions in Aim 1. -We will package and version low-dimensional representations and annotate these -representations based on their corresponding celluar features (e.g. cell type, tissue, -biological process) and deliver these as structured data objects in Bioconductor as well as -platform-agnostic data formats. Where applicable, we will leverage the computational tools -previously developed by Bioconductor for single-cell data access to the HCA, data -representation (`SingleCellExperiment`, `beachmat`, `LinearEmbeddingMatrix`, `DelayedArray`, -`HDF5Array` and `rhdf5`) and data assessment and amelioration of data quality (`scater`, -`scran`, `DropletUtils`). We are core package developers and -power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of -these materials via the *AnnotationHub* framework. To enable reproducible research -leveraging HCA, we will implement a content-based versioning system, -which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide +We will package and version low-dimensional representations and annotate these +representations based on their corresponding celluar features (e.g. cell type, tissue, +biological process) and deliver these as structured data objects in Bioconductor as well as +platform-agnostic data formats. Where applicable, we will leverage the computational tools +previously developed by Bioconductor for single-cell data access to the HCA, data +representation (`SingleCellExperiment`, `beachmat`, `LinearEmbeddingMatrix`, `DelayedArray`, +`HDF5Array` and `rhdf5`) and data assessment and amelioration of data quality (`scater`, +`scran`, `DropletUtils`). We are core package developers and +power users of Bioconductor (PIs Hicks and Love) and will support on-the-fly downloading of +these materials via the *AnnotationHub* framework. To enable reproducible research +leveraging HCA, we will implement a content-based versioning system, +which identifies versions of the reference cell type catalog by the gene weights and transcript nucleotide sequences using a hash function. We (PIs Love and Patro) previously developed hash-based versioning and provenance -detection framework for bulk RNA-seq that supports reproducible -computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. +detection framework for bulk RNA-seq that supports reproducible +computational analyses and has proven to be successful [@doi:10.18129/B9.bioc.tximeta]. Our versioning and dissemination of reference cell type catalogs -will help to avoid scenarios where researchers report on matches to a certain cell type in -HCA without precisely defining which definition of that cell type. We will develop -*F1000Research* workflows demonstrating how HCA-defined reference cell types and tools -developed in this RFA can be used within a typical genomic data analysis. This catalogue -will be used as the basis of defining the references for cell type and state, or +will help to avoid scenarios where researchers report on matches to a certain cell type in +HCA without precisely defining which definition of that cell type. We will develop +*F1000Research* workflows demonstrating how HCA-defined reference cell types and tools +developed in this RFA can be used within a typical genomic data analysis. This catalogue +will be used as the basis of defining the references for cell type and state, or individual-specific differences with the linear models proposed in Aim 1. ### Aim 3 -*Rationale:* Low-dimensional representations of scRNA-seq and HCA data make tasks faster and -provide interpretable summaries of complex high-dimensional cellular features. The HCA -data-associated methods and workflows will be valuable to many biomedical fields, but their -use will require an understanding of basic bioinformatics, scRNA-Seq, and how the tools -being developed work. Furthermore, researchers will need exposure to the conceptual basis of -low-dimensional interpretations of biological systems. This aim addresses these needs in +*Rationale:* Low-dimensional representations of scRNA-seq and HCA data make tasks faster and +provide interpretable summaries of complex high-dimensional cellular features. The HCA +data-associated methods and workflows will be valuable to many biomedical fields, but their +use will require an understanding of basic bioinformatics, scRNA-Seq, and how the tools +being developed work. Furthermore, researchers will need exposure to the conceptual basis of +low-dimensional interpretations of biological systems. This aim addresses these needs in three ways. -First, we will develop a bioinformatic training program for biologists at all levels, -including those with no experience in bioinformatics. Lecture materials will be extended -from existing materials from previous bioinformatic courses we (PI Hampton) have run at -Mount Desert Island Biological Laboratory, the University of Birmingham, UK, and Geisel -School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists -in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of -the success of these learning experiences has to do with our instructional paradigm, which -includes a very challenging course project coupled with one-on-one support from instructors. -We will develop a new curriculum specifically tailored to HCA that incorporates: 1) didactic -course material on single cell gene expression profiling (PI Goff), 2) -machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional +First, we will develop a bioinformatic training program for biologists at all levels, +including those with no experience in bioinformatics. Lecture materials will be extended +from existing materials from previous bioinformatic courses we (PI Hampton) have run at +Mount Desert Island Biological Laboratory, the University of Birmingham, UK, and Geisel +School of Medicine at Dartmouth since 2009. These courses have trained over 400 scientists +in basic bioinformatics and always achieve approval ratings of over 90%. We believe part of +the success of these learning experiences has to do with our instructional paradigm, which +includes a very challenging course project coupled with one-on-one support from instructors. +We will develop a new curriculum specifically tailored to HCA that incorporates: 1) didactic +course material on single cell gene expression profiling (PI Goff), 2) +machine learning methods (PI Greene), 4) statistics for genomics (PIs Fertig and Hicks), 4) search and analysis in low-dimensional representations, and 5) tools developed by our group in response to this RFA. -Second, the short course will train not only students, but also instructors. Our one-on-one -approach to course projects will require a high instructor-to-student ratio. We will -therefore recruit former participants of this class to return in subsequent years, first as -teaching assistants, and later as module presenters. We have found that course alumni are -eager to improve their teaching resumes, that they learn the material in a new way as they -begin to teach it, and that they are an invaluable resource in understanding how to improve -the course over time. Part of our strategy is to support this community, which includes many -people who will drive the next wave of innovation. All of our course materials will be -freely available, enabling course participants to bring what they learned home with them. A -capstone session will be included in which we will provide suggestions about how the -materials presented in the course can be incorporated into existing course curricula. Course -faculty will be available to assist with integration effort after the course. Finally, the short course will facilitate scientific collaborations +Second, the short course will train not only students, but also instructors. Our one-on-one +approach to course projects will require a high instructor-to-student ratio. We will +therefore recruit former participants of this class to return in subsequent years, first as +teaching assistants, and later as module presenters. We have found that course alumni are +eager to improve their teaching resumes, that they learn the material in a new way as they +begin to teach it, and that they are an invaluable resource in understanding how to improve +the course over time. Part of our strategy is to support this community, which includes many +people who will drive the next wave of innovation. All of our course materials will be +freely available, enabling course participants to bring what they learned home with them. A +capstone session will be included in which we will provide suggestions about how the +materials presented in the course can be incorporated into existing course curricula. Course +faculty will be available to assist with integration effort after the course. Finally, the short course will facilitate scientific collaborations by engaging participants in utilizing these tools for collaborative research efforts. **[I feel like we are missing a concluding summary of broader impacts to pull this together - could be a brief bulleted summary of tools required by app as Andrew suggested - EJF]** -