"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723

donpellegrino · 2022-01-28T16:21:44Z

The latest-core collection at https://databus.dbpedia.org/dbpedia/collections/latest-core as downloaded on January 28, 2022 has many "Bad IRI" and "Illegal character in IRI" issues across the data as reported by Apache Jena's riot --validate command. For example:

article-templates_lang=en.ttl.bz2 : 474.07 sec : 50,428,351 Triples : 106,372.54 per second : 0 errors : 28,718 warnings

It would be more robust to ensure the published triples pass all syntax checks.

References:

https://jena.apache.org/documentation/io/

The text was updated successfully, but these errors were encountered:

kurzum · 2022-01-29T08:53:17Z

@donpellegrino I am transferring this issue to https://github.com/dbpedia/extraction-framework/issues

kurzum · 2022-01-29T08:58:42Z

Hi @donpellegrino,

It would be more robust to ensure the published triples pass all syntax checks.

there is a lot of variation in this and there is no such thing as "all" syntax checks. About a year ago, we built this parser: https://github.com/dbpedia/databus-derive which uses Jena 3.13.1
It is highly parallelized and should be one if not the fastest parser out there. It also does more than parsing as it also writes quite detailed parselogs and logs all malformed triples.

We also publish the parselogs here: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/
Back then we filed a bug report about a warning in Jena and they especially updated their parser in version 3.13.1 for us.

I think that this here https://github.com/dbpedia/databus-derive/blob/master/src/main/java/org/dbpedia/databus/derive/io/rdf/NoErrorProfile.java is the exact parser profile we are using to configure Jena.

I looked at http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/2021.09.01/article-templates_lang=en_debug.txt.bz2 and it seems that we need to update icu, which is the unicode library. most of the problems are caused by new emojis.

Then the result of riot --validate highly depends on the Jena version you are using. I tested with rapper/libraptor and there is no error found in 2021.12.01

rapper -i ntriples article-templates_lang\=en.ttl -c 
rapper: Parsing URI file:///home/kurzum/Downloads/article-templates_lang=en.ttl with parser ntriples
rapper: Parsing returned 50428351 triples

Looking at 0 errors : 28,718 warnings this seems to be the Jena warning fixed related to NFKC Unicode. @donpellegrino could you post the jena version and potentially more detailed information?

@Vehnem parselogs after 09.2021 are missing: http://dbpedia-mappings.tib.eu/parse-reports/generic/article-templates/

donpellegrino · 2022-01-31T14:06:02Z

I used Jena version 3.17.0:

> riot --version
Jena:       VERSION: 3.17.0
Jena:       BUILD_DATE: 2020-11-25T19:40:23+0000

For the Unicode interpretation, I am not sure if that comes from Jena directly or would depend on the underlying Java implementation. For my original report, I was running it with Oracle Java 1.8.0_291-b10:

> java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)

The locale is UTF-8:

> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Switching to OpenJDK 11.0.13:

> java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-suse-3.65.1-x8664)
OpenJDK 64-Bit Server VM (build 11.0.13+8-suse-3.65.1-x8664, mixed mode)

OpenJDK 11.0.13 also gives the warnings:

> riot --validate --time article-templates_lang\=en.ttl.bz2
<snip>
09:02:11 WARN  riot            :: [line: 50422047, col: 35] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝅘𝅥[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422047, col: 36] Illegal character in IRI (Not a ucschar: 0xDD72): <http://dbpedia.org/resource/𝅘𝅥?[U+DD72]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 31] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 32] Illegal character in IRI (Not a ucschar: 0xDDBA): <http://dbpedia.org/resource/?[U+DDBA]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 33] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝆺[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 34] Illegal character in IRI (Not a ucschar: 0xDD65): <http://dbpedia.org/resource/𝆺?[U+DD65]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 35] Illegal character in IRI (Not a ucschar: 0xD834): <http://dbpedia.org/resource/𝆺𝅥[U+D834]...>
09:02:11 WARN  riot            :: [line: 50422048, col: 36] Illegal character in IRI (Not a ucschar: 0xDD6F): <http://dbpedia.org/resource/𝆺𝅥?[U+DD6F]...>
article-templates_lang=en.ttl.bz2 : 593.94 sec : 50,428,351 Triples : 84,904.79 per second : 0 errors : 28,718 warnings

Vehnem · 2022-02-07T14:38:49Z

Hi, I will check it this week.
The issue seems valid. The RDF pruning/validation process seems to have failed (or was not working correctly)

kurzum transferred this issue from dbpedia/databus-maven-plugin Jan 29, 2022

jlareck added the status: triage-discussion-needed label Feb 6, 2022

Vehnem added the status: accepted label Feb 7, 2022

Vehnem self-assigned this Feb 7, 2022

jlareck added status: fix-required PR related to issue is needed status: minidump-test-required and removed status: triage-discussion-needed labels Feb 21, 2022

jeswr mentioned this issue May 15, 2023

2022-03 snapshot contains invalid IRIs with unescaped double quotes #751

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723

"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723

donpellegrino commented Jan 28, 2022

kurzum commented Jan 29, 2022

kurzum commented Jan 29, 2022

donpellegrino commented Jan 31, 2022

Vehnem commented Feb 7, 2022

"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723

"Bad IRI" and "Illegal character in IRI" across latest-core collection. #723

Comments

donpellegrino commented Jan 28, 2022

kurzum commented Jan 29, 2022

kurzum commented Jan 29, 2022

donpellegrino commented Jan 31, 2022

Vehnem commented Feb 7, 2022