-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Images dataset contains wrong triples #720
Comments
I have an extensive sample set that we can use to test when this issue is resolved |
@jaygray0919 Could you please send this sample set? Looks like that I resolved the issue but not sure that completely (at least produced dataset doesn't contain |
@jlareck try using this: |
@jlareck this also worked well 6 months ago, but is now very slow/unresponsive: |
@jlareck anything we can do to help out here? |
Hi @jaygray0919, thank you for providing this link with examples! I checked some triples in the upcoming release image dataset and as I see some wrong images were not extracted but there are still some triples that contain images not related to the wikipage. So, the image extractor that produces the data is only partitially fixed.
Could you please provide more details what do you want to do? |
The url |
Hi @jaygray0919, sorry, but it looks like we cannot restore uncorrupted dataset at the moment. Image dataset should have a better quality in the upcoming release, but it still contains some wrong triples. I am discovering those triples now, and we will try to fix image extraction till the next release |
Got it. |
Hi @jaygray0919, could you please check more images on your website if there are any incorrect images? Because it seems to me that I fixed the image extraction and all images should be correct. Thank you |
Hello @jlareck - will do; will report back today/tomorrow |
Previous errors that have been corrected: Small problems: I'll look for other errors later today |
Actually, this is the correct image. Check the page https://en.wikipedia.org/wiki/Feredayia_graminosa , this article contains 3 images. I think that if the current version of image extraction extracts all pictures from wikipages, and produces multiple triples with
And regarding to this, I think it is a one more issue in image extraction that I didn't notice before, but now it is related to creating incorrect links to wikimedia images |
Unfortunately, your (sensible) exception handling is difficult to implement. Returning to the big picture, your corrections seem to handle the glaring issues (biologics like Russian tanks; aircraft; etc.) |
@jlareck good first milestone :-). but can you please write the documentation for the images dataset https://databus.dbpedia.org/dbpedia/generic/images/ and explain what to expect there. @jaygray0919 thanks for testing and finding issues. |
@JJ-Author I'll revist the SPARQL query, which has some age to it. |
@JJ-Author I made a pull request with the documentation for the image dataset: dbpedia/marvin-config#4 . Could you please check it? |
Issue validity
https://dbpedia.org/page/Borysthenia_goldfussiana
https://dbpedia.org/page/Ingoldiomyces
There are more triples in the DBpedia snapshot 2021-09 that contain this issue
Error Description
Looks like ImageExtractorNew produces triples from Wikipedia pages that don't contain images. For example https://en.wikipedia.org/wiki/Borysthenia_goldfussiana, it doesn't contain any image but the ImageExtractorNew produced triple with image http://commons.wikimedia.org/wiki/Special:FilePath/T-72_ATE_South_Africa.jpg from it. The same issue with page https://en.wikipedia.org/wiki/Ingoldiomyces, it doesn't contain any picture but ImageExtractorNew also produced triple with image https://upload.wikimedia.org/wikipedia/commons/c/cf/B%26N_nook_Logo.svg
Pinpointing the source of the error
This error occurs in https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ImageExtractorNew.scala
Details
We must remove that kind of triples
The text was updated successfully, but these errors were encountered: