Skip to content

Commit

Permalink
VIRION-SRA removal
Browse files Browse the repository at this point in the history
Remove all text that indicates the VIRION-SRA dataset exists and is part of the main dataset
  • Loading branch information
Colin J. Carlson authored May 30, 2021
1 parent 8654517 commit f46c995
Showing 1 changed file with 2 additions and 33 deletions.
35 changes: 2 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ VIRION aggregates five major sources of information, three of which are dynamic
- The [public data](https://healthmap.org/predict) released by the USAID Emerging Pandemic Threats PREDICT program.
- GLOBI\*, the [Global Biotic Interactions](http://globalbioticinteractions.org/) database.
- NCBI GenBank\*, specifically the entirety of NCBI Virus accessions stored in the Nucleotide database.
- NCBI Sequence Read Archive\*, which includes metagenomic samples that have undergone taxonomic analysis. In a sense, these interactions are **predictions, not observations**, as viral identity has not been confirmed through genome assembly, and a number of contaminants could lead to false positives (see below). As such, we recommend the exclusion of these data except in cases where users are able to navigate these complexities.

<p align = "center">
<img src="Figures/VIRIONworkflow.jpg" width="5000">
Expand Down Expand Up @@ -56,7 +55,6 @@ It's that simple! Here's a few small tips and tricks you should know:
- Some valid records have NA's in their taxonomy; for example, if an unclassified _Betacoronavirus_ is found in a mouse, it might be recorded as NA in the "Virus" field. This is an intentional feature, as it enables researchers to talk about higher-level taxonomic patterns, and [some studies](https://www.biorxiv.org/content/10.1101/2020.05.22.111344v3) may not need fully-resolved data.
- Sometimes, you'll see taxonomy that's outdated or strange. If you think there's an error, please leave an issue on the Github. Before you do, it may be worth checking whether a given name is correctly resolved to the NCBI taxonomy; for example, in R, you can use `taxize::classification("Whateverthe latinnameis", db = "ncbi")`. If the issue is related to that taxonomic backbone, please label your issue `ncbi-needed`
- Different databases may have overlapping records. For example, some PREDICT records are deposited in GenBank, and some GenBank records are inherited by EID2. As different data has passed between these sources, they've often lost some metadata. Presence in different datasets therefore does not indicate stronger / weaker evidence, and conversely, conflicting evidence between databases may not be indicative of any biological evidence.
- If you don't read the rest of this `README.md`, cut all SRA records from the dataset before beginning any analysis.

### File organization and assembly

Expand All @@ -71,46 +69,17 @@ Like most datasets that record host-virus associations, this includes a mix of d

- As a starting point, you can remove any records that aren't taxonomically resolved to the NCBI backbone (`HostNCBIResolved == FALSE, VirusNCBIResolved == FALSE`). We particularly suggest this for data that come from other databases that also aggregate content but use multiple taxonomic backbones, which may include invalid names that are not updated.

- You should also be wary of records with a flag that indicates host identification by researchers was uncertain (`HostFlagID == TRUE`) or that indicates the virus in a particular metagenomic sample is a common contaminant or false positive (`VirusFlagContaminant == TRUE`). The former is usually derived from source data, while the latter is provided as part of VIRION's quality control efforts.
- You should also be wary of records with a flag that indicates host identification by researchers was uncertain (`HostFlagID == TRUE`).

- Limiting evidence standards based on diagnostic standards (e.g., using Nucleotide and Isolation/Observation records, but no Antibodies or k-mer) or based on redundancy (i.e., number of datasets that record an association) can also lead to stronger results.

- We encourage particular caution with regard to the validity of virus names. Although the NCBI and ICTV taxonomies are updated against each other, valid NCBI names are not guaranteed to be ICTV-valid species level designations, and many may include sampling metadata. We recommend that researchers manually curate names where possible, but can also use simple rubrics to reduce down controversial names. For example, in the list of NCBI-accepted betacoronavirus names, eliminating all virus names that include a "/" (e.g., using `stringr::str_detect()`) will reduce many lineage-specific records ("bat coronavirus 2265/philippines/2010", "coronavirus n.noc/vm199/2007/nld") and leave behind cleaner names ("alpaca coronavirus") but won't necessarily catch everything ("bat coronavirus ank045f"). Another option is to limit analysis to viruses that are ICTV ratified (`ICTVRatified == TRUE`), but this is particularly conservative, and will leave a much larger number of valid virus names out.

- Finally, we encourage every researcher who uses this data to make a deliberate choice about the use of metagenomic data (see below).

### A special note about the metagenomics

Unlike nearly every dataset familiar to disease ecologists, this dataset includes a mix of known interactions (e.g., PCR detection of virus) and _predicted interactions_ (the _k_-mer analysis conducted on the Sequence Read Archive). As such, the data cannot be used safely and uncritically off the shelf, **it should not be indexed in other datasets of known interactions**, and it should be carefully evaluated by researchers with attention to the mix of data standards.

<img src="Figures/SRA.jpeg" align="right" width="400">

To generate the SRA component of the dataset, we used a _k_-mer based [taxonomy tool](https://www.biorxiv.org/content/10.1101/2021.02.16.431451v1) which identifies the number of virus "hits" in a host sample. Starting from an analysis of every possible pairwise combination (given in `SRA_as_Edgelist.zip` for any interested researchers), we identified the maximum number of hits for a given pair. Subsequently, we use the CLOVER data as a ground truth (for "known associations" or not) to identify the most plausible associations, with a threshold selected that maximizes the kappa statistic. This returns the top ~1% of possible associations (see the figure to the right). Given that most matches are likely to be imprecise, we currently restrict the taxonomy of these matches to the genus level and above (though species matches can be reconstructed from the VirusTaxID field at this time).

Even though this analysis is incredibly conservative, it is in a preliminary state, and only _predicts_ possible associations. (These could later be confirmed, for example, by assembling viral genomes from the original SRA samples.) As such, users should be _very_ careful about whether or not they include these interactions in their analysis. Example problems they might encounter:

- The prediction could, simply, be a false positive. This is an imprecise method.

- The highest scoring match might be a known relative of an unknown virus (for example, in a sample with a novel bat betacoronavirus, the highest score might be returned for SARS-CoV). This is partially solved by removing the Virus names, but not entirely.

- The score might be a product of technological issues or cross-contamination (see the VirusFlagContaminant field).

As such, users may want to remove all of these records entirely from the dataset, which can be done in a single line of code using the `DetectionMethod` or `Database` columns, e.g.,

```
library(tidyverse); library(magrittr)
virion %<>% filter(!(DetectionMethod == "kmer")) # option 1
virion %<>% filter(!(Database == "SRA")) # option 2 (currently equivalent)
```

Other, more advanced users may be interested in using the entire edge list of possible host-virus associations in SRA, which is found in `SRA_as_Edgelist.zip`. Alternate scoring methods that are less conservative will include many more false positives, but also potentially more true positives. In the long term, we hope to develop score metrics that are more informative but still easily incorporated into the VIRION architecture.

# Additional information
### Citing VIRION

Please do not use VIRION for published research yet! This is only a beta release and probably contains a _number_ of bugs that we still need to fix. (Similarly, so those bugs don't escape our orbit, please don't reproduce the data elsewhere yet!)

### Contact
- For general questions about VIRION, please reach out to [Colin Carlson]([email protected]) or [Gregory Albery]([email protected]).
- For specific questions about the SRA data, please contact [Timothée Poisot]([email protected]) or [Ryan Connor]([email protected]).
- For general questions about VIRION, please reach out to [Colin Carlson]([email protected]).
- For specific questions about the CLOVER dataset, please contact [Rory Gibb]([email protected]).

0 comments on commit f46c995

Please sign in to comment.