-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add sequence data sharing #22
Comments
I personaly would not overlap sequence and specimen concepts. We use a "origin" edge to relate things like specimens, sequences, extracts, parts etc. Considering specimens and sequences to be the same will ultimately force your code to handle disparate attributes with a lot of exception handling, creating spagetti in the logic. I.e. consider a include rather than inherit model for shared attributes. |
Agree with @mjy - you could often have several sequences for one specimen such as: https://www.gbif.org/occurrence/2251571184 i.e. from different regions |
Is there sense in linking the sequence to the taxon or must it be through a specimen? My prime usecase is still sharing OTU reference libraries, but Im a bit scared of modelling full specimens |
You definitely need to be able to link sequences to taxon alone because there are cases where there is no specimen metadata (unfortunately way to many). Whether this means you need to track Specimen is unclear. I'd suggest you could just add an sequence attribute, some URI, that points to a specimen concept perhaps. The other canonical case of requiring Specimen is for type material, but this isn't modelled yet. Frankly both are major extensions to CoL, that bring many, many rabbit-holes. Maybe you just need attributes on Taxon that target specific kinds of sequences, like CO1 barcodes. Perhaps just an array of Genbank or BOLD identifiers would suffice. Though the meaning of these sequences without a proceedure etc. is also unclear. Perhaps what you really only need is a BOLD BIN (or equivalent) attribute, a URI that points to that BIN, in Taxon. |
For Linnean names, the only straight forward case is when you have a sequence from the holotype. Then you could in principle associate the sequence with the name, and if it is the accepted name or the basionym of the accepted, then the sequence represents the species. If we want sequences related to Linnean names without a holotype sequence, specimens gets important too. When there is no holotype sequence, the metadata of the sequence is important to assess whether it is actually a good representative for the species. Latitude and and longitude are important but also who identified the specimen, i.e. who claims that this sequence is belonging to the taxon. Also images and of course the type status of the specimen, if any. Take an example like the common Cantharelle, if you look at a distribution map from GBIF it is widespread in both Europe and the US.
I guess we will have the OTUs (BINS / SHs) as children of the Linnean taxon in the extended catalogue as we have have in the GBIF Backbone already. |
But if we only ever want sequences for OTUs (BOLD, UNITE, SILVA) then we could attach them to the taxon directly, right? This is currently my only real use case. But then again the Lepidoptera community clearly said they need to deal with barcodes in their daily business. But that would mean specimens. Maybe a start is a sequence entity on its own that can be used both from a taxon directly (immediate use) and later by an upcoming specimen/material citation table. |
@thomasstjerne do you already have an idea about suitable fields? |
In principle it is the MIxS , most important:
Actually we didn´t include the lat_lon field in the extension, as it was intended for Occurrences / Specimens . The MIxS seems to be a standard for sequence meta data, but it collapses some things like primers, lat_lon etc into single fields which is probably not a good idea. |
closing this in favor of the older duplicate #12 |
for reference libraries and barcoding we need to share actual sequence data. discuss how to best do it and if specimen entities are needed. is a single sequence always a specimen so its the same?
The text was updated successfully, but these errors were encountered: