-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing comments and abstracts for english for multiple articles #714
Comments
It is very interesting that this error occurs also with http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food , because when I run the server locally on my machine, it extracted abstracts for English: Then I guess the error on the server can be related to the configuration that is used by dief.tools.dbpedia.org/server/ (maybe it uses old version of extraction framework). |
@jlareck can you patch the html in order that it shows the commit (optionally also the branch) it is using (that has a hyperlink to github). I think sometimes the cronjob fails or the redeploy script is not mature yet. so yes the service was out of date. but it is hard to recognize. so displaying this simple information here http://dief.tools.dbpedia.org/server/extraction/en/ could really help |
@JJ-Author not completely clear what I need to do. So I need to find the commit where this error occurs, am I right? |
no just write a commit which prints out the current commit hash of the build on the DIEF extractor webpage. you could use sth. like this https://github.com/git-commit-id/git-commit-id-maven-plugin. |
by the way i updated the webservice now manually. but we dont know for sure because we dont see which commit it is using. |
Oh, well I also noticed another thing. So, maybe the problem was in not correct usage of api server call, because this url works fine: http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Eating+your+own+dog+food&revid=&format=trix&extractors=custom . @pkleef Could you please check it and say if it is expected result? Or maybe I don't understand what result must be on the server |
@JJ-Author, as I understood, I need to add current commit information on the DIEF server page and link to the commit on the github page. And I can add it somewhere at the top of the page (for example near the ontology) or at the footer? |
@jlareck I can confirm your dief tools link to the article does show the triples i do not see when loading the 2021-06 Databus snapshot on the http://dbpedia.org/sparql endpoint. See this result: My main concern is that the Databus dump apparently was reported as successful, yet on a number of articles did not dump the english abstracts and comments. The DBpedia team needs to figure out why these comments were not dumped as this could be an indication that extraction errors are not properly caught and reported. As a side note for the DIEF tool, i see i used the wrong URL form:
but i was not expecting a Java Exception Would it be possible to add some argument checking and produce a slightly more informative error page? |
Yes I think this would make sense, right? so that we always know whether we are using the latest code? |
@Vehnem @kurzum Maybe it makes sense to have some metrics here? like number of abstracts compared in total and compared to total number of entities. So that we can track whether abstracts are getting more or less from release to release? and maybe track this also for other artifacts? Maybe using the void mods?
@jlareck do you know whether exception statistics / summary for the extraction are written in general? I know there is a logging of exceptions. @pkleef my best guess is that the commit used for the 2021-06 extraction did not have the fix yet. @Vehnem @jlareck is there a way to determin the commit hash for a marvin extraction now? |
@JJ-Author I think exception statistics and summary is written for each language wikidump separately. So, as I understand, we can see how many pages were successfully extracted and how many failed for example after extraction of the English wikidump. |
Well, I found the reason why Eating_your_own_dog_food was not extracted. So, during English extraction, there were too many requests using wikimedia API and that's why extraction of this page failed. Here is error for this page:
This error occurred very often during June extraction: for the English dump, there were 910000 exceptions |
I and Marvin checked logs one more time today and this exception occurred not |
The Wikipedia API is heavily used, so maybe there needs to be some kind of request control firing not too many request per second. I can imagine that they have a load balancer, so in case not many load is on the system they are gracious. It is maybe also worth having a look here https://www.mediawiki.org/wiki/API:Etiquette |
as I said the number of triples is more expressive when compared to the number of articles extracted. But for me it seems reasonable? less failed requests -> more triples |
Issue validity
https://dbpedia.org/resource/Eating_your_own_dog_food?lang=*
https://dbpedia.org/resource/Paul_Erd%C5%91s?lang=*
NOTE: http://dief.tools.dbpedia.org/server/extraction/en/Eating_your_own_dog_food returns an error at this time
Error Description
I received several reports of articles with missing english (and possible other language) triples for dbo:abstract and dbo:comment.
Pinpointing the source of the error
Error occurs on the current 2021-06 snapshot of the databus dump that is loaded on http://dbpedia.org/sparql
The text was updated successfully, but these errors were encountered: