Example Configuration

A simple example index configuration looks as follows:

version: "1.0"
indexMode: INDEX_SPARQL_ENDPOINT
sparqlEndpoint: https://dbpedia.org/sparql
indexFields:
  - fieldName: label
    documentVariable: resource
    query: >
      SELECT ?resource ?label WHERE {
        {
          ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?label .
          FILTER(lang(?label) = 'en')
          #VALUES#
        }
      } 
      LIMIT 10000

The configuration fields describe the following (some less important fields are not described here and can be found in the configurations sections below):

indexMode: In this example, we are indexing a knowledge graph using its SPARQL endpoint, hence index mode is set to INDEX_SPARQL_ENDPOINT.
sparqlEndpoint: Needs to be specified when indexMode is set to INDEX_SPARQL_ENDPOINT. In this example, this points to the SPARQL endpoint of the DBpedia knowledge graph
indexFields: The index fields are the core of a lookup indexer configuration. The conist of a list of index field objects that consist of the following:
- fieldName: The name of the field to be indexed. This is also the name of the variable in the query that will hold the field values
- documentVariable: The name of the variable in the query that will hold the ID of the target lucene document.
- query: The query describing the selection of document ID and field value from the target knowledge graph.
- the #VALUES# comment in the query: #VALUES# is a SPARQL comment and thus usually ignored. It acts as a placeholder for a SPARQL VALUES clause that gets inserted automatically when passing a list of URIs to the indexing process via the values parameter. This can be used to only retrieve bindings for a specific set of resources.

Testing the Queries

Before running the indexer, it is best to test each query in order to rule out any bugs that might be hard to trace later on.

Since this current example is configured to run queries against a SPARQL endpoint (index mode set to INDEX_SPARQL_ENDPOINT with https://dbpedia.org/sparql) we can simply test our queries via HTTP requests. The first configured query could be tested, by running

SELECT ?resource ?label WHERE {
  {
    ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?label .
    FILTER(lang(?label) = 'en')
    #VALUES#
  }
} 
LIMIT 10000

against the DBpedia SPARQL endpoint.

The result is a list of bindings with two entries. The first entry (city) will be used a document ID, while the second value (label) will be written to each respective document under the field label. A document created by this query could look like the following:

Document
  id: 'http://dbpedia.org/resource/Berlin'
  label: 'Berlin'

A user searching for the string "Berl" over the field label will then be able to quickly retrieve the entire document, since the label field value partially matches the search string.

Configuration

indexPath

The path of the target folder for the index structure (absolute or relative to the configuration file). This can be either an empty folder or a folder containing an already existing index structure.

dataPath

[Optional] This variable is only required when indexing the contents of RDF files. Points to the folder containing the files to index.

databasePath

[Optional] This variable is only required when running with either build mode BUILD_AND_INDEX_ON_DISK or INDEX_ON_DISK. Specifies the path of the on-disk graph database which is required for both modes.

sparqlEndpoint

[Optional] Only needs to be specified when indexMode is set to INDEX_SPARQL_ENDPOINT. Specifies the target SPARQL endpoint URL.

indexMode

Defines the indexing approach. Has to be one of INDEX_IN_MEMORY, BUILD_AND_INDEX_ON_DISK, INDEX_ON_DISK or INDEX_SPARQL_ENDPOINT (see enum IndexMode). The index modes change the behaviour of the lookup indexer as follows:

INDEX_IN_MEMORY

The indexer loads the content of the RDF files defined at dataPath into an in-memory graph database, which is then used to execute the configured SPARQL queries. Only works for small to medium files, since the index structure can eat up a lot of RAM.

BUILD_AND_INDEX_ON_DISK

Similar to INDEX_IN_MEMORY, but the graph database is created on-disk (see TDB2). This takes considerably more time and makes querying slower but can handle much larger files, since disk space is generally more abundant than memory. The on-disk graph database will be created at the path specified in databasePath*

INDEX_ON_DISK

Same as BUILD_AND_INDEX_ON_DISK but skips the on-disk graph database creation step. This mode tries to load an existing on-disk graph database from databasePath as a SPARQL query target.

INDEX_SPARQL_ENDPOINT

Runs the configured queries against the SPARQL endpoint URL specified in sparqlEndpoint.

indexFields

The index fields are the core of a lookup indexer configuration. The consist of a list of index field objects that have the following subfields:

fieldName

The name of the field to be indexed. This is also the name of the variable in the query that will hold the field values

documentVariable

The name of the variable in the query that will hold the ID of the target lucene document. Has to match one of the binding variables selected in query.

query

The SPARQL select query describing the selection of document ID and field value from the target knowledge graph. The binding variables of the select query must contain variable names matching the values specified in fieldName and documentVariable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexing.md

indexing.md

Example Configuration

Testing the Queries

Configuration

indexPath

dataPath

databasePath

sparqlEndpoint

indexMode

INDEX_IN_MEMORY

BUILD_AND_INDEX_ON_DISK

INDEX_ON_DISK

INDEX_SPARQL_ENDPOINT

indexFields

fieldName

documentVariable

query

Files

indexing.md

Latest commit

History

indexing.md

File metadata and controls

Example Configuration

Testing the Queries

Configuration

indexPath

dataPath

databasePath

sparqlEndpoint

indexMode

INDEX_IN_MEMORY

BUILD_AND_INDEX_ON_DISK

INDEX_ON_DISK

INDEX_SPARQL_ENDPOINT

indexFields

fieldName

documentVariable

query