Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sequence data sharing #22

Closed
mdoering opened this issue Dec 4, 2019 · 9 comments
Closed

add sequence data sharing #22

mdoering opened this issue Dec 4, 2019 · 9 comments

Comments

@mdoering
Copy link
Member

mdoering commented Dec 4, 2019

for reference libraries and barcoding we need to share actual sequence data. discuss how to best do it and if specimen entities are needed. is a single sequence always a specimen so its the same?

@mjy
Copy link

mjy commented Dec 4, 2019

I personaly would not overlap sequence and specimen concepts. We use a "origin" edge to relate things like specimens, sequences, extracts, parts etc. Considering specimens and sequences to be the same will ultimately force your code to handle disparate attributes with a lot of exception handling, creating spagetti in the logic. I.e. consider a include rather than inherit model for shared attributes.

@thomasstjerne
Copy link
Contributor

thomasstjerne commented Dec 4, 2019

Agree with @mjy - you could often have several sequences for one specimen such as: https://www.gbif.org/occurrence/2251571184

i.e. from different regions

@mdoering
Copy link
Member Author

mdoering commented Dec 4, 2019

Is there sense in linking the sequence to the taxon or must it be through a specimen? My prime usecase is still sharing OTU reference libraries, but Im a bit scared of modelling full specimens

@mjy
Copy link

mjy commented Dec 4, 2019

You definitely need to be able to link sequences to taxon alone because there are cases where there is no specimen metadata (unfortunately way to many).

Whether this means you need to track Specimen is unclear. I'd suggest you could just add an sequence attribute, some URI, that points to a specimen concept perhaps.

The other canonical case of requiring Specimen is for type material, but this isn't modelled yet.

Frankly both are major extensions to CoL, that bring many, many rabbit-holes.

Maybe you just need attributes on Taxon that target specific kinds of sequences, like CO1 barcodes. Perhaps just an array of Genbank or BOLD identifiers would suffice. Though the meaning of these sequences without a proceedure etc. is also unclear.

Perhaps what you really only need is a BOLD BIN (or equivalent) attribute, a URI that points to that BIN, in Taxon.

@thomasstjerne
Copy link
Contributor

For Linnean names, the only straight forward case is when you have a sequence from the holotype. Then you could in principle associate the sequence with the name, and if it is the accepted name or the basionym of the accepted, then the sequence represents the species.
This is also where the BINS and SH´s are handy because a sequence will always be a representative for the BIN / SH it sits in and therefore it can be directly associated with the BIN / SH (= Taxon name)

If we want sequences related to Linnean names without a holotype sequence, specimens gets important too. When there is no holotype sequence, the metadata of the sequence is important to assess whether it is actually a good representative for the species. Latitude and and longitude are important but also who identified the specimen, i.e. who claims that this sequence is belonging to the taxon. Also images and of course the type status of the specimen, if any.

Take an example like the common Cantharelle, if you look at a distribution map from GBIF it is widespread in both Europe and the US.
However, recent years studies have shown that there is no genetic evidence that the common European Cantharelle occurs in the US at all
The species was described in 1821 by Fries, and therefore there is no holotype, i.e. we dont know which of the many Chantharelle species the name actually applies to. However, to be justify that a sequence is a good representative for the species, the specimen should be collected in Central Sweden in coniferous forest.

Perhaps what you really only need is a BOLD BIN (or equivalent) attribute, a URI that points to that BIN, in Taxon.

I guess we will have the OTUs (BINS / SHs) as children of the Linnean taxon in the extended catalogue as we have have in the GBIF Backbone already.

@mdoering
Copy link
Member Author

mdoering commented Dec 5, 2019

But if we only ever want sequences for OTUs (BOLD, UNITE, SILVA) then we could attach them to the taxon directly, right? This is currently my only real use case. But then again the Lepidoptera community clearly said they need to deal with barcodes in their daily business. But that would mean specimens.

Maybe a start is a sequence entity on its own that can be used both from a taxon directly (immediate use) and later by an upcoming specimen/material citation table.

@mdoering
Copy link
Member Author

mdoering commented Dec 5, 2019

@thomasstjerne do you already have an idea about suitable fields?

@thomasstjerne
Copy link
Contributor

thomasstjerne commented Dec 5, 2019

In principle it is the MIxS , most important:

  • sequence
  • target_gene
  • target_subfragment
  • pcr_primers (these are supposed to be in one field in MIxS whereas GGBN has separate fields for forward and reverse)
  • url (to GenBank, BOLD or similar)

Actually we didn´t include the lat_lon field in the extension, as it was intended for Occurrences / Specimens .

The MIxS seems to be a standard for sequence meta data, but it collapses some things like primers, lat_lon etc into single fields which is probably not a good idea.

@mdoering
Copy link
Member Author

closing this in favor of the older duplicate #12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants