Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank / Classify Sources #34

Open
salgo60 opened this issue Oct 30, 2024 · 5 comments
Open

Rank / Classify Sources #34

salgo60 opened this issue Oct 30, 2024 · 5 comments

Comments

@salgo60
Copy link

salgo60 commented Oct 30, 2024

see Wikidata_talk:WikiProject_Reference_Verification

I stated in 2019 that we need to rank sources see T222142 Wikidata has now been used a lot of a research project "Riksdagens Corpus" ( @BobBorges ) and we agree that a sources like Svenskt Biografiskt Lexikon-ID (P3217) / Svenskt biografiskt lexikon (Q379406) / Tvåkammar-riksdagen 1867–1970 (Q110346241) are very good sources, they are just textstrings so to use them in Wikidata its some manual work see issue #78

My suggestion: add a ranking value for sources so more people can agree and understand that e.g. Svenskt Biografiskt Lexikon-ID (P3217) is high quality and have a quality process I think there was some measurement for prizes i.e. that getting the Nobelpriset (Q7191) is ranked higher than getting a prize xxx see my thoughts 2019 that prizes could be a way of evaluating research in different countries... "T216409 Nobelprize as part of evaluating research in different countries"

Maybe we can have dashboards how different research projects support PROV and use quality sources to motivate research to move faster in the right direction....

@salgo60 salgo60 changed the title Rank (Classify Sources Rank / Classify Sources Oct 30, 2024
@salgo60
Copy link
Author

salgo60 commented Oct 30, 2024

Denny Vrandečić about his vision of sources

@BobBorges
Copy link

It would be really good to rank sources if objective criteria could be applied to the ranking.

@salgo60
Copy link
Author

salgo60 commented Nov 1, 2024

@BobBorges listen to Denny above he tells that en:Wikipedia rank sources. Guess it would be better if the ranking is Done by your project and SBL….

I use Wikidara rank feature and mark wrong facts by e.g. bad precision or not States in the birth record…. —> in the long run we get a rather good quality measurement. I like the way your project test your data against external “sources” like Wikidata but miss that I don’t see SBL in a metadata roundtrip echosystem….

Using Wikidata for handling contradicting sources

image image

@albertmeronyo
Copy link
Member

Thanks @salgo60 and @BobBorges for the insightful discussion, quality of references is something we deeply care about.

Let me first just say that ProVe is based on research [1] that takes quality of sources into account, by comparing the degree to which the textual content of external references supports the verbalisation of Wikidata triples. We only take that as a basis to build a tool (the one in this repo) that could be of use to Wikidata editors. The output classifies sources into several types/boxes/colours which goes exactly into the direction Denny is pointing at.

That said, I tend to agree with @BobBorges that objective criteria here are a challenging issue. We would be really keen on compiling different 'feelings' and approaches to quality of sources under various perspectives, perhaps by building a dataset that we can use to improve the model behind ProVe.

[1] Amaral, G., Rodrigues, O. and Simperl, E., 2022. ProVe: A pipeline for automated provenance verification of knowledge graphs against textual sources. Semantic Web, (Preprint), pp.1-34.

@salgo60
Copy link
Author

salgo60 commented Nov 29, 2024

Thanks @albertmeronyo

I recommend delving into the "architecture" behind Wikidata and Denny Vrandečić's vision, particularly on the types of research projects that can be undertaken regarding sources video when I pointed out that we need facts with sources and also metadata if we can trust a source.

——

I tend to agree with @BobBorges that objective criteria here are a challenging issue.

I believe one key takeaway from @BobBorges’ project is that:

  1. Each project defines its own trust criteria.
    1. This aligns with the original vision of Linked Data as envisioned by Tim Berners-Lee
    2. Wikidata appears to lack flexibility in allowing users to explicitly define their own trusted sources, relying instead on a more general approach. All users contribute to editing all objects, requiring collaboration and consensus on what is deemed most trustworthy, while also supporting over 200 languages. This seems like an unsolvable equation; however, the lesson learned is that this approach does bring some value, even though it is far from perfect. Wikidata also remains vulnerable to vandalism or poorly executed edits, even when made with good intentions, which highlights its fragility in maintaining data integrity and should just be a POC for research data...
      1. Wikidata support of handling contradicting sources is something we need to see in research datasets
  2. Over time, you develop a deeper understanding of the quality of the sources you rely on, which naturally shapes your level of trust in them
  3. Research projects, however, often lack a "generic" data model that incorporates PROV, which I see as a sign of immaturity in producing reliable, trusted data. This also reflects a missed opportunity to adopt a data-driven approach with the goal of generating high-quality data that can be effectively reused by other research projects.
    1. I also lack a clear understanding of the importance of 5-star data and its role in ensuring high-quality, reusable information. I hope lessons learned from using data from Wikidata might inspire adopting a similar approach—leveraging references for facts, effectively managing contradictory information, and assigning persistent identifiers (PIDs) to all sources. These practices could showcase their benefits, and the ability to easily retrieve data using SPARQL could become a best practice for future research projects, enhancing both transparency and reusability.

Looking ahead, as data-driven research becomes more prominent and metadata round-tripping improves, it will become increasingly important to explicitly define the trustworthiness and quality of datasets.

Example of research project using Wikibase

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants