Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support offline-enabled wikidata taxon matcher #181

Closed
jhpoelen opened this issue Jun 22, 2024 · 6 comments
Closed

support offline-enabled wikidata taxon matcher #181

jhpoelen opened this issue Jun 22, 2024 · 6 comments

Comments

@jhpoelen
Copy link
Member

as related to #146

@jhpoelen
Copy link
Member Author

Currently, the wikidata dump is about 83.5G too large to fit into Zenodo.

Suggest to only include items with reference to a Taxon https://www.wikidata.org/wiki/Q16521

image

@jhpoelen
Copy link
Member Author

sketch of workflow -

#!/bin/bash
#
# streams Wikidata taxon items (or items containing https://www.wikidata.org/wiki/Q16521)
# from latest data dump in line json (one json object per line)
#

curl --silent "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"\
| bunzip2\
| grep -E "Q16521[^0-9]"\
| sed 's/,$//g'\
| bzip2

@jhpoelen
Copy link
Member Author

jhpoelen commented Jun 24, 2024

hey @Daniel-Mietchen

Would you happen to know how to translate a wikimedia url like

https://commons.wikimedia.org/wiki/File:002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg

into a link that renders a jpg ?

PS I've dropped indexing the wikidata taxon images until we develop a method to point to a image (or image rendering link) directly.

@jhpoelen
Copy link
Member Author

jhpoelen commented Jun 25, 2024

A first pass at implementing an offline-enabled wikidata taxon matcher -

echo -e "\tElymus repens"\
 | nomer append\
 --include-header wikidata\
 | mlr --itsvlite --oxtab cat

produced -

providedExternalId      
providedName            Elymus repens
relationName            HAS_ACCEPTED_NAME
resolvedExternalId      WD:Q276262
resolvedName            Elymus repens
resolvedAuthorship      
resolvedRank            WD:Q7432
resolvedCommonNames     Gewöhnliche Quecke @de | quackgrass @en | niittyjuola @fi | 偃麦草 @zh
resolvedPath            Spermatophytes | Magnoliophyta | Liliopsida | Commelinidae | Cyperales | Poaceae | Pooideae | Triticeae | Elymus | Elymus repens
resolvedPathIds         WD:Q25814 | WD:Q14562931 | WD:Q1147601 | WD:Q1115272 | WD:Q1860104 | WD:Q43238 | WD:Q4662262 | WD:Q148694 | WD:Q1072892 | WD:Q276262
resolvedPathNames       WD:Q3491997 | WD:Q38348 | WD:Q37517 | WD:Q5867051 | WD:Q36602 | WD:Q35409 | WD:Q164280 | WD:Q227936 | WD:Q34740 | WD:Q7432
resolvedPathAuthorships |  |  |  |  |  |  |  |  |
resolvedExternalUrl     https://www.wikidata.org/wiki/Q276262

@jhpoelen
Copy link
Member Author

Note that non-wikidata identifiers are also supported, if known to wikidata -

e.g.,

echo -e "ITIS:512839"\
  | nomer append --include-header wikidata\
 | mlr --itsvlite --oxtab cat
providedExternalId      ITIS:512839
relationName            SYNONYM_OF
resolvedExternalId      WD:Q276262
resolvedName            Elymus repens
resolvedAuthorship      
resolvedRank            WD:Q7432
resolvedCommonNames     Gewöhnliche Quecke @de | quackgrass @en | niittyjuola @fi | 偃麦草 @zh
resolvedPath            Spermatophytes | Magnoliophyta | Liliopsida | Commelinidae | Cyperales | Poaceae | Pooideae | Triticeae | Elymus | Elymus repens
resolvedPathIds         WD:Q25814 | WD:Q14562931 | WD:Q1147601 | WD:Q1115272 | WD:Q1860104 | WD:Q43238 | WD:Q4662262 | WD:Q148694 | WD:Q1072892 | WD:Q276262
resolvedPathNames       WD:Q3491997 | WD:Q38348 | WD:Q37517 | WD:Q5867051 | WD:Q36602 | WD:Q35409 | WD:Q164280 | WD:Q227936 | WD:Q34740 | WD:Q7432
resolvedPathAuthorships |  |  |  |  |  |  |  |  |
resolvedExternalUrl     https://www.wikidata.org/wiki/Q276262

jhpoelen pushed a commit to globalbioticinteractions/name-alignment-template that referenced this issue Jun 25, 2024
@jhpoelen
Copy link
Member Author

While working towards addressing a misaligned taxon reported in globalbioticinteractions/globalbioticinteractions#968 by @kbseah, a first version of an offline-enabled wikidata taxon name alignment matcher was introduced in Nomer v0.5.11 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant