Skip to content

Latest commit

 

History

History
 
 

10-indexing-records

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Tutorial 10 - Data models: modify records before indexing

The goal of this tutorial is to learn how to take advantage of Elasticsearch by manipulating records fields when indexing.

Table of Contents

Let's imagine that we now have a new use case: when retrieving a list of records from our REST endpoint, we would like to have an extra field for each record that counts the number of contributors. Moreover, we actually don't need the keywords field, so we can remove it. For example, given the following record:

{
  "id": 1,
  "title": "Invenio is awesome",
  "keywords": ["invenio", "CERN"],
  "contributors": [
    {
      "name": "Stark, Tony"
    },
    {
      "name": "Kent, Clark"
    }
  ]
}

it would be handy to have an extra field contributors_count that has value 2 and skip the keywords field, like this:

{
  "id": 1,
  "title": "Invenio is awesome",
  "contributors": [
    {
      "name": "Stark, Tony"
    },
    {
      "name": "Kent, Clark"
    }
  ],
  "contributors_count": 2
}

Let's see how to do it.

Step 1: Bootstrap exercise

If you completed the previous tutorial, you can skip this step. If instead you would like to start from a clean state run the following commands:

$ cd ~/src/training/
$ ./start-from.sh 09-deposit-form

Step 2: Modify the record before indexing

We are going to take advantage of the invenio-indexer signal before_record_index to modify the record fields before indexing. This signal is called every time and just before indexing a record.

If it doesn't exist, create a new file indexer.py and copy the following code:

my-site/my_site/records/indexer.py

"""Record modification prior to indexing."""

from __future__ import absolute_import, print_function


def indexer_receiver(
    sender,
    json=None,
    record=None,
    index=None,
    doc_type=None,
    arguments=None
):
    """Connect to before_record_index signal to transform record for ES.

    :param sender: The Flask application
    :param json: The dumped record dictionary which can be modified.
    :param record: The record being indexed.
    :param index: The index in which the record will be indexed.
    :param doc_type: The doc_type for the record.
    :param arguments: The arguments to pass to Elasticsearch for indexing.
    """

    # delete the `keywords` field before indexing
    if 'keywords' in json:
        del json['keywords']

    # count the number of contributors and add the new field
    contributors = json.get('contributors', [])
    json['contributors_count'] = len(contributors)

Now we need to register the signal in our Invenio instance. We have to connect the signal with our indexer at ext.py in the init_app of our extension.

my-site/my_site/records/ext.py

from __future__ import absolute_import, print_function

+from invenio_indexer.signals import before_record_index
+from .indexer import indexer_receiver
from . import config

...

    def init_app(self, app):
        """Flask application initialization."""
        self.init_config(app)
        app.extensions['my-site'] = self
+       before_record_index.connect(indexer_receiver, sender=app, weak=False)

Finally, let's change the Elasticsearch mappings to update the fields that we have changed.

my-site/my_site/records/mappings/v7/records/record-v1.0.0.json

         "id": {
          "type": "keyword"
        },
-       "keywords": {
-         "type": "keyword"
-       },
        "publication_date": {
          "type": "date",
          "format": "date"
        },
+       "contributors_count": {
+         "type": "short"
+       },
        "contributors": {
          "type": "object",
          "properties": {

Step 3: Try it!

The code is now ready and we can try it. Since we have changed the Elasticsearch mappings, we need to re-create them.

$ cd ~/src/my-site
$ pipenv run pip install -e .
$ pipenv run invenio index destroy --force --yes-i-know
$ pipenv run invenio index init --force
$ pipenv run invenio index queue init purge
$ ./scripts/server

In case you have a clean instance, we can create a record like this:

$ curl -k --header "Content-Type: application/json" \
    --request POST \
    --data '{"title": "Invenio is awesome", "contributors": [{"name": "Kent, Clark"}], "owner": 1}' \
    "https://127.0.0.1:5000/api/records/?prettyprint=1"

Stop the server. Let's re-index all records:

$ cd ~/src/my-site
$ pipenv run invenio index reindex --pid-type recid --yes-i-know
$ pipenv run invenio index run

We can now create a new record, using the deposit of the previous exercise, and verify that in Elasticsearch we can see the modified fields.

$ ./scripts/server
$ firefox http://127.0.0.1:9200/records/_search?pretty=true

Let's try to add a record with more contributors:

$ curl -k --header "Content-Type: application/json" \
    --request POST \
    --data '{"title": "Invenio is awesome 2", "contributors": [{"name": "Kent, Clark"}, {"name": "Wayne, Bruce"}, {"name": "Stark, Tony"}], "owner": 1}' \
    "https://127.0.0.1:5000/api/records/?prettyprint=1"
$ firefox http://127.0.0.1:9200/records/_search?pretty=true

The contributors_count field for the last created record should have value 3.

What did we learn

  • We have seen how to connect to a signal
  • We have learned how to modify data before indexing
  • Finally, how to re-index all our records