-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Info box property having both resource and string literal #713
Comments
Yes it should be filtered out in the post processing, but i think there is also some issue in the extraction process. Even when there is a Person in the property value, it still trims the remaining part of property value as soon as it find some resource in the property value. |
Hm indeed I think 2 triples should be extracted. Just to make sure i had a look at the post processing with this pretty neat SPARQL query to check if one of the triples was pruned, which seems to be not the case. |
Nice, the example result looks good to me. I wonder why this is language specific? Is there need to fix this for other languages as well? |
I only found instances of English language where single quotations ‘’ are used as separators when writing title as prefix before/after actual name, therefore i think currently only english chapter need these changes. |
I have checked the post processing file typeconsistencycheck.scala, currently nondisjoined triples are also stored in regular set, even when there is domain/range violation. Due to this many instances have spouse and other family relations which are Agent and not Person. As this is clearly range violation, these must be stored in a separate dataset and not the regular one. |
Also please answer this https://forum.dbpedia.org/t/dbpedia-post-processing-for-adhoc-extraction/1412 |
The relevant post processing step is called here. https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/blob/master/functions.sh#L62. Yeah I see the problem for Bates_family. I think the underlying problem which led to this potentially confusing decision is that sometimes types are not specific enough when extracted (e.g. sometimes Actors are only Persons instead of Actors). So filtering out these triples with an "exact match" for domain / range could have many false positives and lead to many useful triples being filtered out But I think it makes sense to progress here and just create more fine grained files not only disjoint but also. Maybe it makes sense to have one file per length in the class hierarchy tree (so length zero means it is an exact match, 1 means it is of the type of the parent class only. And the end we can still decide if we still load them, but at least they are separated. But potentially more important is to also cover the case that the entity does not actually have a type. That is the case for Burmese_name which has owl:Thing see here. In the end no range will ever be disjoint to owl:Thing. Any thoughts on that? |
I have created a SPARQL query to get instances where spouse and parent is same, It returned 44 results which are below:
After making two changes i again tested the above query and the incorrect results are reduced to only 4, 3 of them are due to wrong classification of entity by some extractor and reason for remaining 1 is unknown. Remaining 4 incorrect records: |
If we filter the entities which have no type, a lot of instances will be removed as Wikipedia to DBpedia mappings coverage is very little up till now. Most of the above mentioned issues are due to Type association which results in incorrect results when queried, therefore i propose to introduce the extraction_score property for each entity based on how its Type is inferred. Following could be the step by step approach:
Score value can be adjusted by further analysis of Type inferred process. What you think of it? |
I think creating and populating these more fine-grained datasets that you have posted the image is a first good approach, then we can see what happens on large scale for an entire extraction. The idea with a triple is interesting. But I don't know about types inferred by NIF extractor to be fair. So at the moment I think we would only have 1 and 0 as output (whereas zero means no other type than owl:Thing). So it should be possible to already query this at the moment IIRC? |
@mubashar1199 I really like the query that nobody should be its own parent / own spouse. We should add this as a general test. @Vehnem do we have large scale plausibility shacl tests in place already? |
In a wikipedia infobox when a property value contains both resource and string literal, then the dbpedia framework skips the literal part which results in incorrect data being populated in DBpedia.
Please see the spouse and parents properties of entity Aung_Lwin
Aung_Lwin Wiki profile
Aung_Lwin DBpedia profile
where Daw is a Burmese_Name
Suggestions:
OR
Please assign this issue to me.
The text was updated successfully, but these errors were encountered: