Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality to match taxa in a checklist to taxa of the GBIF backbone #746

Open
damianooldoni opened this issue Aug 28, 2024 · 5 comments

Comments

@damianooldoni
Copy link
Collaborator

Hi 👋

While discussing with some GBIF users at my institute I came to the conclusion that retrieving the (accepted) taxon keys from the GBIF Backbone related to a specific checklist is something we do many, many times and that probably I am not the only one having written some code. GBIF doesn't allow to retrieve all occurrences linked to a checklist (instead of the backbone) and so having a function to do it will be useful.

I think such functionality would really fit within rgbif package. What do you think about it? In the link to my code I split the functionality in two functions (my gist was written as a workflow) but I think it's more convenient to offer the functionality as one function where we can allow the users to decide if they want to return the matched taxa (as they are) or directly link to the accepted ones if some synonyms occur.

@jhnwllr
Copy link
Collaborator

jhnwllr commented Aug 29, 2024

Hi @damianooldoni

I would be a little hesitant to write a custom function for this one task, when downloading occurrences from taxa a checklist is only a few lines of code (if this is what you mean....)

library(rgbif) 

readr::read_tsv("taxon.txt") %>%
name_backbone_checklist() %>%
pull(usageKey) %>%
pred_in("taxonKey",.) %>%
occ_download()

Additionally, rgbif will likely start to support the checklistbank api. COL builds from checklistbank should eventually replace the GBIF backbone. Therefore, some name_* functions in rgbif may need to be modified or deprecated in the somewhat near future.

For large checklists, backbone matching (or other matching) could be done using the new tool
https://www.checklistbank.org/tools/name-match-async

I haven't looked into it, but I think this name-match-async tool could be eventually included by rgbif.

@damianooldoni
Copy link
Collaborator Author

I understand your point. And thanks for the code snippet.

However, my request is slightly different as I meant a GBIF checklist as input, so the main argument would be a datasetKey, for example 79d65658-526c-4c78-9d24-1870d67f8439. In other words, the matching to the GBIF backbone is already present (field nubKey). I am sorry if I was not clear enough in my first comment.

A reprex follows where I extracted the basic functionality from the gist code linked above and tried to give some draft names to function/args just for showing you what I mean exactly.

library(rgbif)
library(dplyr)
library(purrr)

#' `datasetKey`: Unique identifier of a species checklist.
#' `allow_synonyms`: If `FALSE`, the accepted taxa are returned instead of the
#'                   synonyms, if any. Default: `TRUE`.
name_backbone_gbif_checklist <- function(datasetKey, allow_synonyms = TRUE) {
  nub_keys <- rgbif::name_usage(datasetKey = datasetKey, limit = 9999)$data %>%
    dplyr::filter(origin == "SOURCE") %>%
    dplyr::pull(nubKey) %>%
    unique()
  if (allow_synonyms == TRUE) {
    return(nub_keys)
  } else {
    nub_keys %>%
    purrr::map_df(function(x) rgbif::name_usage(x)$data) %>%
    # Choose the accepted taxa instead of synonyms
    mutate(accepted_taxa = dplyr::coalesce(acceptedKey, key)) %>%
    dplyr::pull(accepted_taxa) %>%
    unique()
  }
}

# Get the (unique) taxon keys from the GBIF Backbone
name_backbone_gbif_checklist("79d65658-526c-4c78-9d24-1870d67f8439")
#>  [1] 2480764 2394604 6247411 2437399 5855350 2482499 2486131 2227300 2225776
#> [10] 2226990 5035230 5219681 2350580 2362868 2440946 2394486 2502792 5035017
#> [19] 5219683 2443002 7965247 4284921 5035187 2498252 2433536 2227064 8971201
#> [28] 5218786 2437394 8979506 5219858 4264680 2227000 8909595 1315391 2350570
#> [37] 2440934 2434271 2437450 2434552 5224480 2498305 2427091 2427092 2390064
#> [46] 5712056 8879526 9442269 5217334 2489005 2340977 5824863 5274863 5579439
#> [55] 4033648 5334406 7978544 2870583 5289808 2891770 5329212 3084923 3034825
#> [64] 3170247 8848208 3129663 2865565 3086784 3169169 2704521 3189935 2977647
#> [73] 2977654 5421039 3054399 7748792 2984537 5361762 3642949 2765940 8114276
#> [82] 2955720 3190653 2650436 2706080 2869311 5358460 2882443 2978552 2980328
#> [91] 5420991 2706134 8000520 7287606 2702865 5361785 2984306 6063677
# Get the (unique) ACCEPTED taxon keys from the GBIF Backbone
name_backbone_gbif_checklist(
  datasetKey = "79d65658-526c-4c78-9d24-1870d67f8439",
  allow_synonyms = FALSE
)
#>  [1] 2480764 2394604 6247411 2437399 5855350 2482499 2486131 2227300 2225776
#> [10] 2226990 5035230 5219681 2350580 2362868 2440946 2394486 2502792 5035017
#> [19] 5219683 2443002 7965247 4284921 5035187 2498252 2433536 8971201 5218786
#> [28] 2437394 8979506 5219858 4264680 8909595 1315391 2350570 2440934 2434271
#> [37] 2437450 2434552 5224480 2498305 2427091 2390064 5712056 9442269 5217334
#> [46] 2489005 2340977 5824863 5274863 4033648 7978544 2870583 5289808 2891770
#> [55] 5329212 3084923 3034825 3170247 8848208 3129663 2865565 3086784 3169169
#> [64] 2704521 3189935 2977647 5421039 3054399 2984537 5361762 3642949 2765942
#> [73] 8114276 3190653 2650436 2706080 2869311 5358460 2882443 2978552 5420991
#> [82] 5828232 3628745 7287606 2702865 5361785 2984306 6063677

Created on 2024-08-29 with reprex v2.1.0

Indeed, moving to checklistbank API and moving from GBIF backbone to COL builds will change quite a lot of things in rgbif, I can imagine. Thanks for the update. Adding a new function will be from the maintaining point of view quite a bad move 😄 Saying this and the fact my code snippet is in both the cases not such long or complex, I think a (section of a) vignette can be enough? Still, the need of directly retrieving occurrences based on a GBIF species checklist (not a txt with some names) is quite present in the GBIF users community. What do you think about it?

Thank you very much for your patience and keep up the great work 💪

@jhnwllr
Copy link
Collaborator

jhnwllr commented Aug 30, 2024

@damianooldoni ok I see the use case now. I think we can think about something like this once I get some more information about how disruptive the switchover to checklistbank will be for rgbif.

@jhnwllr
Copy link
Collaborator

jhnwllr commented Aug 30, 2024

@damianooldoni I thought you might find it amusing that I copy and pasted this function to finish something I was doing for work.

@damianooldoni
Copy link
Collaborator Author

Happy to hear my code was useful, indeed. Long live open source, long live open science.
About the functionality proposed in this issue, as I already mentioned before, I agree to set it "on hold" as the the switchover to checklistbank must be THE priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants