Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request for comment: rate dataset descriptions based on presence of properties #683

Open
coret opened this issue Feb 21, 2023 · 2 comments

Comments

@coret
Copy link
Contributor

coret commented Feb 21, 2023

Rating

To stimulate dataset providers to improve their dataset descriptions in terms of completeness of properties, a rating system is proposed.

The dataset descriptions are rated with 1 to 5 stars, depending on the content (presence of properties) of the dataset description:

  • Each dataset description that has the required license, title and publisher gets a ☆ rating
  • If a dataset description also has a description and distribution, then the dataset description receives a ☆☆ rating
  • If a dataset description also has a creator and landingPage, the dataset description will receive a ☆☆☆ rating
  • If a dataset description also has a created, modified/updated and/or issued/published date, the dataset description will receive a ☆☆☆☆ rating
  • If a dataset description also has a language, source, keyword, spatial and/or temporal, the dataset description will receive a ☆☆☆☆☆ rating

This method does not (yet):

  • promote multi-language content
  • evaluate the quality of the content (eg. is the description understandable, does the contentURL of the distribution exist, is it linked data?), the method just evaluates based on quantity
  • not all schema:Dataset properties as defined in Requirements for datasets are evaluated, nor are the schema:DataDownload (distribution) properties

The rating for each of the dataset properties (both schema.org and DCAT):

Schema.org DCAT ☆☆ ☆☆☆ ☆☆☆☆ ☆☆☆☆☆
schema:license dct:license must
schema:name dct:title must
schema:publisher dct:publisher must
schema:description dct:description must
schema:distribution dct:distribution must
schema:creator dct:creator must
schema:mainEntityOfPage dct:landingPage must
schema:dateCreated dct:created one-of
schema:dateModified dct:modified one-of
schema:datePublished dct:issued one-of
schema:inLanguage dct:language one-of
schema:isBasedOnUrl dct:source one-of
schema:keywords dct:keyword one-of
schema:spatialCoverage dct:spatial one-of
schema:temporalCoverage dct:temporal one-of
schema:citation dct:isReferencedBy
schema:genre dct:type
schema:version dct:hasVersion
schema:includedInDataCatalog dct:isPartOf

Construction

The rating of a dataset description in stored in a separate graph https://data.netwerkdigitaalerfgoed.nl/registry/description_ratings with the property schema:contentRating.

The graph is constucted using 5 SPARQL INSERT queries:

☆☆☆☆☆ rating

PREFIX schema: <http://schema.org/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
INSERT {
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/description_ratings> {
        ?dataset schema:contentRating "☆☆☆☆☆"
    }
} WHERE {
    ?dataset a dcat:Dataset ;
    dct:description ?o1 ;
    dcat:distribution ?o2 ;
    dct:creator ?o3 ;
    dcat:landingPage ?o4 .
    { ?dataset dct:created ?o5 . }
    UNION
    { ?dataset dct:modified ?o6 . }
    UNION
    { ?dataset dct:issued ?o7 . }
    UNION
    { ?dataset dct:language ?o8 . }
    UNION
    { ?dataset dct:source ?o9 . }
    UNION
    { ?dataset dcat:keyword ?o10 . }    
}

☆☆☆☆ rating

PREFIX schema: <http://schema.org/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
INSERT {
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/description_ratings> {
        ?dataset schema:contentRating "☆☆☆☆"
    }
} WHERE {
    ?dataset a dcat:Dataset ;
             dct:description ?o1 ;
             dcat:distribution ?o2 ;
             dct:creator ?o3 ;
             dcat:landingPage ?o4 .
    { ?dataset dct:created ?o5 . }
    UNION
    { ?dataset dct:modified ?o6 . }
    UNION
    { ?dataset dct:issued ?o7 . }
    FILTER NOT EXISTS { ?dataset dct:language ?o8 . }
    FILTER NOT EXISTS { ?dataset dct:source ?o9 . }
    FILTER NOT EXISTS { ?dataset dcat:keyword ?o10 . }
    FILTER NOT EXISTS { ?dataset schema:contentRating ?o11 . }
}

☆☆☆ rating

PREFIX schema: <http://schema.org/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
INSERT {
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/description_ratings> {
        ?dataset schema:contentRating "☆☆☆"
    }
} WHERE {
    ?dataset a dcat:Dataset ;
             dct:description ?o1 ;
             dcat:distribution ?o2 ;
             dct:creator ?o3 ;
             dcat:landingPage ?o4 .
    FILTER NOT EXISTS { ?dataset dct:created ?o5 . }
    FILTER NOT EXISTS { ?dataset dct:modified ?o6 . }
    FILTER NOT EXISTS { ?dataset dct:issued ?o7 . }
    FILTER NOT EXISTS { ?dataset schema:contentRating ?o8 . }
}

☆☆ rating

PREFIX schema: <http://schema.org/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
INSERT {
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/description_ratings> {
        ?dataset schema:contentRating "☆☆"
    }
} WHERE {
    ?dataset a dcat:Dataset ;
             dct:description ?o1 ;
             dcat:distribution ?o2 .
    FILTER NOT EXISTS {
        ?dataset dct:creator ?o3 .
        ?dataset dcat:landingPage ?o4 .
    }
    FILTER NOT EXISTS { ?dataset schema:contentRating ?o5 . }
}

☆ rating

PREFIX schema: <http://schema.org/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
INSERT {
    GRAPH <https://data.netwerkdigitaalerfgoed.nl/registry/description_ratings> {
        ?dataset schema:contentRating "☆"
    }
} WHERE {
    ?dataset a dcat:Dataset ;
             dct:license ?o1 ;
             dct:title ?o2 ;
             dct:publisher ?o3 .
    FILTER NOT EXISTS {
        ?dataset dct:description ?o4 .
        ?dataset dcat:distribution ?o5 .
    }
    FILTER NOT EXISTS { ?dataset schema:contentRating ?o6 . }
}

TODO

  • check for completeness (do all datasets have a rating?)
  • determine how/when to calculate ratings (eg. remove graph and execute insert queries above)

Selection

The rating can be used for sorting and to show the rating (in the demonstrator), with a query like:

PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
SELECT DISTINCT ?dataset ?title ?publisherName ?rating WHERE {
    ?dataset a dcat:Dataset ;
             dct:title ?title ;
             dct:publisher ?publisher .
    ?publisher foaf:name ?publisherName .
    OPTIONAL {
        ?dataset schema:contentRating ?rating
    }
    FILTER(LANG(?title) = "" || LANGMATCHES(LANG(?title), "nl"))
    FILTER(LANG(?publisherName) = "" || LANGMATCHES(LANG(?publisherName), "nl")) 
    FILTER CONTAINS(LCASE(?title),"archief") .
} ORDER BY DESC(?rating) ?title

The following aggregation query shows the number of datasets with a specific number of stars:

PREFIX schema: <http://schema.org/>
SELECT ?rating (COUNT(*) AS ?datasets_with_rating) WHERE { 
	?dataset schema:contentRating ?rating .
} GROUP BY ?rating

Output on 21-2-2023:

rating datasets_with_rating
"☆" "279"^^xsd:integer
"☆☆" "376"^^xsd:integer
"☆☆☆" "11"^^xsd:integer
"☆☆☆☆" "9"^^xsd:integer
"☆☆☆☆☆" "111"^^xsd:integer
@coret
Copy link
Contributor Author

coret commented Apr 5, 2023

Suggestion for the demonstrator: make the stars a link to an explanation page so the user can read why the dataset got that specific number of stars, and what can be done to acquire more stars. Easiest if to make 5 static pages. Somewhat harder is to make a datasetdescription specific page.

@ddeboer
Copy link
Member

ddeboer commented Jun 28, 2023

In our meeting on 28 June 2023, we decided:

  • to call this rating completeness (volledigheid) instead of quality (kwaliteit)
  • visualise this as a completion bar instead of stars to nudge publishers to complete their dataset description; @eddeheerna will ask a UX designer to come up with a good solution
  • show ‘dataset description completeness’ next to ‘dataset quality’ (which we cannot rate as of yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants