This repository hosts the new version of the IndeGx application. This new version will be a Typescript application interfaced with a Corese server instance, both encapsulated in a Docker image. The generation rules are rewritten to be more consistent in the form of SPARQL Update queries sent to the Corese server.
IndeGx is a framework created for the creation of an RDF knowledge graph that can be used as an index of knowledge graphs available online. It uses a set of rules to extract and compute the content of this index using SPARQL queries.
IndeGx is designed to make full use of semantic web standards. Any rules expressed with a standard SPARQL syntax can be used to generate a knowledge graph.
This new version of IndeGx has advantages compared to the previous one in the DeKaLoG repository:
- The previous version was a Java application coded with Apache Jena, this version uses an engine coded in Typescript with rdflib, graphy, sparqljs, coupled with a Corese Server, in a docker application.
- Treatment of endpoints in parallel
- The automatic pagination of simple queries to avoid overwhelming SPARQL endpoints.
- The usage of Corese as an interface with SPARQL endpoints to reduce missing data due to errors coming from incorrect standard compliance in distant SPARQL endpoints.
- Rules are now expected to make heavy use of federated querying, with the
SERVICE
clause.
- Rules are now expected to make heavy use of federated querying, with the
- Possibility to define the application of several rules as a prerequisite to the application of another.
- End of the difference between
CONSTRUCT
andUPDATE
rules to differentiate between the application of local and distant queries. Only test queries are supposed to beSELECT
,ASK
, orCONSTRUCT
. All action queries are expected to beUPDATE
queries. - Possibility to define a set of rules as post-treatment on the extracted data. In this case, the endpoint URL becomes the URL of the local corese server (not accessible from the outside of the docker)
- Integration of LDscript in rules possible.
The index extraction rules have been modified as follows:
- A knowledge base description's asserted and computed elements are now separated into different graphs. Each description now consists of 3 graphs for one KB. More details are in the rule folder.
- The extraction of a summary of datasets using the HiBisCus format: (<hostname of subject> <property> <hostname of object, placeholder literal>)
- The computation of statistics of the usage of vocabularies and the usage of the hostnames of resources in the dataset.
IndeGx as an application is executed in a docker container. It contains the IndeGx engine, anda Corese server. The Corese server is used both to execute the rules against remote SPARQL endpoints and to store the results of the rules application. The Corese server is not accessible from outside the docker container.
IndeGx requires git, and docker-compose installed on a Linux-based system.
To use IndeGx, first clone this repository in a folder using :
git clone https://github.com/Wimmics/IndeGx.git .
Execution:
./run.sh
Two catalogs of endpoints are maintained in this repository:
- The catalog of endpoints, used in the IndeGx paper and updated every day from several online sources. The list of sources and the script used to generate the catalog are available in a dedicated script folder.
- The catalog of the status of endpoint is based on the catalog of endpoint and is updated every hour to indicate if an endpoint is online or not. The script is detailed in a dedicated script folder.
There are 7 directories in this repository:
catalog
: contains catalogs of SPARQL endpoints as RDF files. Catalogs are expected to be writtent using the DCAT vocabulary. They will be available from the inside of the inside of the docker container from/catalog/
.config
: contains configuration for both IndeGx and its Corese server. They will be available from the inside of the inside of the docker container from/config/
. More details in the configuration section.indegx
: contains the code of the IndeGx engine.input
: should contain the input files for the current rules. They will be available from the inside of the docker container from/input/
.output
: will contain the output files of the execution of IndeGx. They will contain the logs from the IndeGx engine and from the Corese server. They will also contain the result of the application of the rules, in the from of a TriG file. They will be available from the inside of the docker container from/output/
.rules
: contains the rules for the extraction of the index. They will be available from the inside of the docker container from/rules/
. More details in the rules section.scripts
: contains scripts used in the different experiments done as part of the development and publications of IndeGx.
At the execution of run.sh
, the content of the indegx
folder will be copied to the docker container. The catalog
, config
, input
, output
, and rules
folders will be mounted as volumes in the docker container. The scripts
folder will not be copied nor mounted.
The IndeGx application is configured by a config file in the /config
folder. By default, the application will use the default.json
file. The config file is a JSON file with the following structure:
{
"manifest": "file:///input/_manifest.ttl",
"post": "file:///input/_post_manifest.ttl",
"catalog": "file:///input/catalog.ttl",
"defaultQueryTimeout": 300,
"nbFetchRetries": 10,
"millisecondsBetweenRetries": 5000,
"maxConccurentQueries": 300,
"delayMillisecondsTimeForConccurentQuery": 1000,
"logFile": "/output/indegx.log",
"outputFile": "/output/index.trig"
}
The structure of the config file is the following:
{
manifest: string, // Path to the main manifest file. It must be in one of the mounted volumes. It must be a valid RDF file. It is recomanded to store it in either the input or the rules folder.
catalog: string, // Path to the catalog file. It must be in one of the mounted volumes. It must be a valid RDF file. It is recomanded to store it in the input
post?: string, // Path to the post manifest file. It must be in one of the mounted volumes. It must be a valid RDF file. It is recomanded to store it in either the input or the rules folder.
pre?: string, // Path to the pre manifest file. It must be in one of the mounted volumes. It must be a valid RDF file. It is recomanded to store it in either the input or the rules folder.
nbFetchRetries?: number, // Default 10, number of retries to fetch a remote file if it fails.
millisecondsBetweenRetries?: number, // Default 5000, number of milliseconds to wait between two retries.
maxConccurentQueries?: number, // Default 300, maximum number of concurrent queries to execute against a SPARQL endpoint.
delayMillisecondsTimeForConccurentQuery?: number, // Default 1000, number of milliseconds to wait between two queries against a SPARQL endpoint.
defaultQueryTimeout?: number, // Default 60000, number of seconds to wait for a query to execute before considering it as a failure.
logFile?: string, // Default /output/indegx.log, path to the log file. It must be in one of the mounted volumes.
outputFile?: string, // Default /output/index.trig, path to the output file. It must be in one of the mounted volumes.
manifestJSON?: string, // Path to the manifest file in JSON format. It must be in one of the mounted volumes. Will not be generated if not provided.
postManifestJSON?: string, // Path to the post manifest file in JSON format. It must be in one of the mounted volumes. Will not be generated if not provided.
queryLog?: boolean, // Default true, log queries in the index if true.
resilience?: boolean, // default false, store the result of the current state at the end of the pre step and main step of the index in a temporary file if true. Incompatible with disabling query logging.
}
The Corese server in configured by the corese-properties.properties
file, in the /config
folder. It is a standard Corese configuration file. More information can be found on the Corese documentation.
The rules
contains several sets of rules used in different use cases during the development of IndeGx. The declaration of the rules is done using the IndeGx vocabulary and the W3C manifest vocabulary (see the rules folder for more details).
- The index extraction rules are continually updated with new features
- As soon as it is technologically sustainable, IndeGx will use the RDF Star standard
- Despite the pagination mechanism and iterative application of rules, it is possible that the extraction of data by IndeGx from a SPARQL endpoint overwhelms the capacities of the endpoint. Careful rule writing may reduce this problem. Yet, there is no way to guarantee that any rule may be successfully applied to any endpoint.
- Some HTTPS endpoints may raise SSL errors when interrogated. IndeGx relies on a Corese server to mitigate those errors. If the Corese server raises an SSL error during the application of a test, IndeGx considers this test as a failure.
@article{maillot2023indegx,
TITLE = {{IndeGx: A Model and a Framework for Indexing RDF Knowledge Graphs with SPARQL-based Test Suits}},
AUTHOR = {Maillot, Pierre and Corby, Olivier and Faron, Catherine and Gandon, Fabien and Michel, Franck},
URL = {https://hal.science/hal-03946680},
JOURNAL = {{Journal of Web Semantics}},
PUBLISHER = {{Elsevier}},
YEAR = {2023},
MONTH = Jan,
DOI = {10.1016/j.websem.2023.100775},
KEYWORDS = {semantic index ; metadata extraction ; dataset description ; endpoint description ; knowledge graph},
PDF = {https://hal.science/hal-03946680/file/_DeKaloG__IndeGx___Web_Semantics_2022-1.pdf}
}