-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723
Comments
@donpellegrino I am transferring this issue to https://github.com/dbpedia/extraction-framework/issues |
Hi @donpellegrino,
there is a lot of variation in this and there is no such thing as "all" syntax checks. About a year ago, we built this parser: https://github.com/dbpedia/databus-derive which uses Jena 3.13.1 We also publish the parselogs here: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/ I think that this here https://github.com/dbpedia/databus-derive/blob/master/src/main/java/org/dbpedia/databus/derive/io/rdf/NoErrorProfile.java is the exact parser profile we are using to configure Jena. I looked at http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/2021.09.01/article-templates_lang=en_debug.txt.bz2 and it seems that we need to update icu, which is the unicode library. most of the problems are caused by new emojis. Then the result of riot --validate highly depends on the Jena version you are using. I tested with rapper/libraptor and there is no error found in 2021.12.01
Looking at @Vehnem parselogs after 09.2021 are missing: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/ |
I used Jena version 3.17.0:
For the Unicode interpretation, I am not sure if that comes from Jena directly or would depend on the underlying Java implementation. For my original report, I was running it with Oracle Java 1.8.0_291-b10:
The locale is UTF-8:
Switching to OpenJDK 11.0.13:
OpenJDK 11.0.13 also gives the warnings:
|
Hi, I will check it this week. |
The latest-core collection at https://databus.dbpedia.org/dbpedia/collections/latest-core as downloaded on January 28, 2022 has many "Bad IRI" and "Illegal character in IRI" issues across the data as reported by Apache Jena's
riot --validate
command. For example:article-templates_lang=en.ttl.bz2 : 474.07 sec : 50,428,351 Triples : 106,372.54 per second : 0 errors : 28,718 warnings
It would be more robust to ensure the published triples pass all syntax checks.
References:
https://jena.apache.org/documentation/io/
The text was updated successfully, but these errors were encountered: