Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider to remove pure annotation authorships #890

Open
mdoering opened this issue Jan 6, 2025 · 11 comments
Open

Consider to remove pure annotation authorships #890

mdoering opened this issue Jan 6, 2025 · 11 comments
Assignees
Labels
feedback User feedback

Comments

@mdoering
Copy link
Member

mdoering commented Jan 6, 2025

COL contains some authorships which are made up from just remarks.
Even if a GSD supplies these I would think we should ask them to clean up their names or otherwise remove them ourselves with decisions.

There are hundreds of synonyms which contain the accepted name in their authorship

Others:

@mdoering mdoering added the feedback User feedback label Jan 6, 2025
@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

@dhobern what do you think as TG?

@yroskov
Copy link

yroskov commented Jan 6, 2025

@bart-v (happy New Year!), what do you think about this? (Seems, many names came from WoRMS World Polychaeta Database).

@yroskov
Copy link

yroskov commented Jan 6, 2025

My personal opinion is that comments in authorstrings always contain important information for taxonomists about the use of a name. If a taxonomist (as the author of a publication/project) decides to add a comment to an authorstring, he would like to make it undetachable from the scientific name (even if there is a separate "Comment" field in the database). As an editor, I respect the practices considered by our authors.

In a practical sense, I have no capacity to control and modify authorstrings in the CoL. However, I have no objection, if authorstrings will be corrected in GSD projects.

@bart-v
Copy link

bart-v commented Jan 6, 2025

I have raised this a few times with our editors already, that this is a bad practice.
They are mostly aware of it, but still keep on doing it, with a reason: it's basically a display problem where they immediately want to see there is a problem with the name, without the need to click on details. They fix that by abusing the authority field.

If we want to properly fix it, we need to adapt the display (in WoRMS) in some way.
So we need a bit of time to think how to properly solve it.
Then we can ask editors to fix & clean entries.
I'm not a big fan of COL making these changes...

(Yuri, Markus, Donald, best wishes for 2025! )

@mdoering
Copy link
Member Author

mdoering commented Jan 6, 2025

Thanks @bart-v, best wishes too!

I fully agree it's always best to have the authors provide the authorship for a name and it is good to know you also think this is bad practice. Let's hope it will disappear in the future.

In general though I think COL has the responsibility of doing at least some basic QC. The authorship field is a pretty important field and not a free text field where you can enter any editorial remark.

@dhobern
Copy link

dhobern commented Jan 6, 2025

Lots of thoughts here, probably none very helpful.

The most essential question is probably why we are producing COL in the first place. I believe our primary use case is to provide the digital reference that users need for interpreting scientific names, either as human readers or as software. The most important part of this is to provide unambiguous information on whether a name is recognised, whether it is the accepted name for a taxon or a name that refers to a taxon with an accepted name, and where this taxon fits in the tree of life.

From this perspective, name usages fall into several buckets:

  1. Actual published scientific names with types and publications we can reference
  2. Names that have been published but that are not formal scientific names (aberrations, etc.)
  3. Misspellings of actual scientific names
  4. Misapplications because someone used a scientific name to refer to something other than the original intended taxon
  5. Various scientific-name-like strings that may be of interest and even useful but that have not been formally published (manuscript names, placeholder names for undescribed species in field guides, etc.)

COL should aim to include every name under 1 - if we can do this, we have achieved something magnificent.

Different communities may also want to record any or all of 2-5, and there should be no harm in including these in GSDs, but we need clarity at all points on what they represent. If these are not clearly marked as different from 1, we are in trouble. These then undercut the functionality for our main user base. Those that are interested in these may only rarely use COL, since interest in each of these categories is likely to be limited to taxonomists and others working with the literature and specimens for a given group. Names that represent temporary labels for undescribed species may be of wider interest, but these are rarely sufficiently standardised to work reliably as part of the naming ecosystem over time.

I my opinion, leaving misapplications unflagged or poorly separated is a big mistake. It can be useful to know that two species have been confused by experts, but it is disastrous if the data suggests that the name itself is ambiguous and impossible to resolve to a single species. LepIndex is/was full of historical misspellings, misapplications, aberrations and other junk, most of them presented as binomials with authorship derived from whatever paper the misspelling, misuse, etc. occurred in. These are just pollutants that make the whole dataset less useful. I have been purging the vast majority of these and only retaining significant misspellings where these are likely to be an issue to other users.

So, my feeling is that @yroskov is correct and contributors should be encouraged to record exactly what they need/want to record as qualifiers for the name, but that we really need to help all contributors make sure any names in categories 2-5 are well marked and can be excluded from downstream products. COL itself should have a way to exclude them from downloads and via the API.

None of this answers the question asked. I think we should do the following:

  1. Produce a factsheet on how to use the COLDP standard (and equivalents) to represent these various cases where the contributors want to include them.
  2. Message all contributors that seem to have such names asking them to work with us to make sure they are properly marked.
  3. Explain that we will start marking these automatically with some status like "misapplied" if the contributor cannot address the issue - we may need a more generic catch-all status.

I've suggested before that we should start sending regular (e.g. quarterly) emails to all contributors with news, tips, etc. This could be a good thing to highlight in such a medium.

@yroskov
Copy link

yroskov commented Jan 7, 2025

I 100% agree with @dhobern that the challenge is to provide unambiguous information on whether a name is recognised, whether it is the accepted name for a taxon or a name that refers to a taxon with an accepted name, and where this taxon fits in the tree of life. (Perfectly formulated!)

However, in my mind, name usages in all five listed buckets are primary tasks for GSDs (including 2-5). For example, only true experts in taxonomy of the group are able to recognize, resolve and reflect misapplications in the checklist.
CoL itself is unable to make scrutinized taxonomic knowledge based checklist without experts (taxonomists). But, CoL, as a publisher/aggregator, should accommodate and correctly re-publish concepts with names from all 5 "buckets".

Unfortunately, I am pessimistic that the task can be facilitated through regular letters to GSDs. Taxonomists do not need our instructions on how to do their job. They need the funds and a new generation of their successors in GSD projects.

@mdoering
Copy link
Member Author

mdoering commented Jan 7, 2025

The status field mostly defines those categories above, together with the nomenclatural status in some cases.

I don't think we are discussing the main question of the issue here though:
a) does COL accept just anything inside the authorship field for names?
b) if not, should COL try to cleanup messy authorships in whatever way

I believe COL has the mission to do this. In my view COL should aim to provide a list of all names as consistently as we can. It will be impossible to achive 100%, but we should strive for consistent name syntax (which luckily the codes mostly demand already), authorships, ranks, reference citations and distributions if we consider them to be relevant. For many things we have enumerations which we interpret lose text values to. In an ideal world we would also link to author records via identifiers instead of having authorship strings. That would clearly not allow editorial remarks...

@dhobern
Copy link

dhobern commented Jan 7, 2025

My point was that COL should perhaps handle this by policing the status field. If names have non-standard authorship that represents one of these other categories but are flagged just as accepted or synonym, I think we should push a different status on them and exclude them from the cleanest views of COL. In other words, we should have a quarantining approach in relation to the main COL product.

@yroskov
Copy link

yroskov commented Jan 8, 2025

Does CoL have the resources to monitor quarantine?

@dhobern
Copy link

dhobern commented Jan 8, 2025

By "quarantine", I just meant automatically excluding non-standard names from the public product. This could be automated, as could notifying the contributor that these names have been excluded and what can be done to make them part of the main product again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback User feedback
Projects
None yet
Development

No branches or pull requests

4 participants