A simple example index configuration looks as follows:
version: "1.0"
indexMode: INDEX_SPARQL_ENDPOINT
sparqlEndpoint: https://dbpedia.org/sparql
indexFields:
- fieldName: label
documentVariable: resource
query: >
SELECT ?resource ?label WHERE {
{
?resource <http://www.w3.org/2000/01/rdf-schema#label> ?label .
FILTER(lang(?label) = 'en')
#VALUES#
}
}
LIMIT 10000
The configuration fields describe the following (some less important fields are not described here and can be found in the configurations sections below):
- indexMode: In this example, we are indexing a knowledge graph using its SPARQL endpoint, hence index mode is set to INDEX_SPARQL_ENDPOINT.
- sparqlEndpoint: Needs to be specified when indexMode is set to INDEX_SPARQL_ENDPOINT. In this example, this points to the SPARQL endpoint of the DBpedia knowledge graph
- indexFields: The index fields are the core of a lookup indexer configuration. The conist of a list of index field objects that consist of the following:
- fieldName: The name of the field to be indexed. This is also the name of the variable in the query that will hold the field values
- documentVariable: The name of the variable in the query that will hold the ID of the target lucene document.
- query: The query describing the selection of document ID and field value from the target knowledge graph.
- the #VALUES# comment in the query: #VALUES# is a SPARQL comment and thus usually ignored. It acts as a placeholder for a SPARQL VALUES clause that gets inserted automatically when passing a list of URIs to the indexing process via the
values
parameter. This can be used to only retrieve bindings for a specific set of resources.
Before running the indexer, it is best to test each query in order to rule out any bugs that might be hard to trace later on.
Since this current example is configured to run queries against a SPARQL endpoint (index mode set to INDEX_SPARQL_ENDPOINT with https://dbpedia.org/sparql) we can simply test our queries via HTTP requests. The first configured query could be tested, by running
SELECT ?resource ?label WHERE {
{
?resource <http://www.w3.org/2000/01/rdf-schema#label> ?label .
FILTER(lang(?label) = 'en')
#VALUES#
}
}
LIMIT 10000
against the DBpedia SPARQL endpoint.
The result is a list of bindings with two entries. The first entry (city) will be used a document ID, while the second value (label) will be written to each respective document under the field label. A document created by this query could look like the following:
Document
id: 'http://dbpedia.org/resource/Berlin'
label: 'Berlin'
A user searching for the string "Berl" over the field label will then be able to quickly retrieve the entire document, since the label field value partially matches the search string.
The path of the target folder for the index structure (absolute or relative to the configuration file). This can be either an empty folder or a folder containing an already existing index structure.
[Optional] This variable is only required when indexing the contents of RDF files. Points to the folder containing the files to index.
[Optional] This variable is only required when running with either build mode BUILD_AND_INDEX_ON_DISK or INDEX_ON_DISK. Specifies the path of the on-disk graph database which is required for both modes.
[Optional] Only needs to be specified when indexMode is set to INDEX_SPARQL_ENDPOINT. Specifies the target SPARQL endpoint URL.
Defines the indexing approach. Has to be one of INDEX_IN_MEMORY, BUILD_AND_INDEX_ON_DISK, INDEX_ON_DISK or INDEX_SPARQL_ENDPOINT (see enum IndexMode). The index modes change the behaviour of the lookup indexer as follows:
The indexer loads the content of the RDF files defined at dataPath into an in-memory graph database, which is then used to execute the configured SPARQL queries. Only works for small to medium files, since the index structure can eat up a lot of RAM.
Similar to INDEX_IN_MEMORY, but the graph database is created on-disk (see TDB2). This takes considerably more time and makes querying slower but can handle much larger files, since disk space is generally more abundant than memory. The on-disk graph database will be created at the path specified in databasePath*
Same as BUILD_AND_INDEX_ON_DISK but skips the on-disk graph database creation step. This mode tries to load an existing on-disk graph database from databasePath as a SPARQL query target.
Runs the configured queries against the SPARQL endpoint URL specified in sparqlEndpoint.
The index fields are the core of a lookup indexer configuration. The consist of a list of index field objects that have the following subfields:
The name of the field to be indexed. This is also the name of the variable in the query that will hold the field values
The name of the variable in the query that will hold the ID of the target lucene document. Has to match one of the binding variables selected in query.
The SPARQL select query describing the selection of document ID and field value from the target knowledge graph. The binding variables of the select query must contain variable names matching the values specified in fieldName and documentVariable.