add sequence data sharing #22

mdoering · 2019-12-04T14:59:55Z

for reference libraries and barcoding we need to share actual sequence data. discuss how to best do it and if specimen entities are needed. is a single sequence always a specimen so its the same?

mjy · 2019-12-04T15:07:40Z

I personaly would not overlap sequence and specimen concepts. We use a "origin" edge to relate things like specimens, sequences, extracts, parts etc. Considering specimens and sequences to be the same will ultimately force your code to handle disparate attributes with a lot of exception handling, creating spagetti in the logic. I.e. consider a include rather than inherit model for shared attributes.

thomasstjerne · 2019-12-04T15:21:44Z

Agree with @mjy - you could often have several sequences for one specimen such as: https://www.gbif.org/occurrence/2251571184

i.e. from different regions

mdoering · 2019-12-04T16:16:28Z

Is there sense in linking the sequence to the taxon or must it be through a specimen? My prime usecase is still sharing OTU reference libraries, but Im a bit scared of modelling full specimens

mjy · 2019-12-04T16:25:06Z

You definitely need to be able to link sequences to taxon alone because there are cases where there is no specimen metadata (unfortunately way to many).

Whether this means you need to track Specimen is unclear. I'd suggest you could just add an sequence attribute, some URI, that points to a specimen concept perhaps.

The other canonical case of requiring Specimen is for type material, but this isn't modelled yet.

Frankly both are major extensions to CoL, that bring many, many rabbit-holes.

Maybe you just need attributes on Taxon that target specific kinds of sequences, like CO1 barcodes. Perhaps just an array of Genbank or BOLD identifiers would suffice. Though the meaning of these sequences without a proceedure etc. is also unclear.

Perhaps what you really only need is a BOLD BIN (or equivalent) attribute, a URI that points to that BIN, in Taxon.

thomasstjerne · 2019-12-05T10:45:41Z

For Linnean names, the only straight forward case is when you have a sequence from the holotype. Then you could in principle associate the sequence with the name, and if it is the accepted name or the basionym of the accepted, then the sequence represents the species.
This is also where the BINS and SH´s are handy because a sequence will always be a representative for the BIN / SH it sits in and therefore it can be directly associated with the BIN / SH (= Taxon name)

If we want sequences related to Linnean names without a holotype sequence, specimens gets important too. When there is no holotype sequence, the metadata of the sequence is important to assess whether it is actually a good representative for the species. Latitude and and longitude are important but also who identified the specimen, i.e. who claims that this sequence is belonging to the taxon. Also images and of course the type status of the specimen, if any.

Take an example like the common Cantharelle, if you look at a distribution map from GBIF it is widespread in both Europe and the US.
However, recent years studies have shown that there is no genetic evidence that the common European Cantharelle occurs in the US at all
The species was described in 1821 by Fries, and therefore there is no holotype, i.e. we dont know which of the many Chantharelle species the name actually applies to. However, to be justify that a sequence is a good representative for the species, the specimen should be collected in Central Sweden in coniferous forest.

Perhaps what you really only need is a BOLD BIN (or equivalent) attribute, a URI that points to that BIN, in Taxon.

I guess we will have the OTUs (BINS / SHs) as children of the Linnean taxon in the extended catalogue as we have have in the GBIF Backbone already.

mdoering · 2019-12-05T10:55:41Z

But if we only ever want sequences for OTUs (BOLD, UNITE, SILVA) then we could attach them to the taxon directly, right? This is currently my only real use case. But then again the Lepidoptera community clearly said they need to deal with barcodes in their daily business. But that would mean specimens.

Maybe a start is a sequence entity on its own that can be used both from a taxon directly (immediate use) and later by an upcoming specimen/material citation table.

mdoering · 2019-12-05T10:57:08Z

@thomasstjerne do you already have an idea about suitable fields?

thomasstjerne · 2019-12-05T11:33:22Z

In principle it is the MIxS , most important:

sequence
target_gene
target_subfragment
pcr_primers (these are supposed to be in one field in MIxS whereas GGBN has separate fields for forward and reverse)
url (to GenBank, BOLD or similar)

Actually we didn´t include the lat_lon field in the extension, as it was intended for Occurrences / Specimens .

The MIxS seems to be a standard for sequence meta data, but it collapses some things like primers, lat_lon etc into single fields which is probably not a good idea.

mdoering · 2019-12-10T16:20:49Z

closing this in favor of the older duplicate #12

mdoering closed this as completed Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sequence data sharing #22

add sequence data sharing #22

mdoering commented Dec 4, 2019

mjy commented Dec 4, 2019

thomasstjerne commented Dec 4, 2019 •

edited

Loading

mdoering commented Dec 4, 2019

mjy commented Dec 4, 2019

thomasstjerne commented Dec 5, 2019

mdoering commented Dec 5, 2019

mdoering commented Dec 5, 2019

thomasstjerne commented Dec 5, 2019 •

edited

Loading

mdoering commented Dec 10, 2019

add sequence data sharing #22

add sequence data sharing #22

Comments

mdoering commented Dec 4, 2019

mjy commented Dec 4, 2019

thomasstjerne commented Dec 4, 2019 • edited Loading

mdoering commented Dec 4, 2019

mjy commented Dec 4, 2019

thomasstjerne commented Dec 5, 2019

mdoering commented Dec 5, 2019

mdoering commented Dec 5, 2019

thomasstjerne commented Dec 5, 2019 • edited Loading

mdoering commented Dec 10, 2019

thomasstjerne commented Dec 4, 2019 •

edited

Loading

thomasstjerne commented Dec 5, 2019 •

edited

Loading