The goal of this tutorial is to learn how to take advantage of Elasticsearch by manipulating records fields when indexing.
- Step 1: Bootstrap exercise
- Step 2: Modify the record before indexing
- Step 3: Try it!
- What did we learn
Let's imagine that we now have a new use case: when retrieving a list of records from our REST endpoint, we would like to have an extra field for each record that counts the number of contributors. Moreover, we actually don't need the keywords
field, so we can remove it.
For example, given the following record:
{
"id": 1,
"title": "Invenio is awesome",
"keywords": ["invenio", "CERN"],
"contributors": [
{
"name": "Stark, Tony"
},
{
"name": "Kent, Clark"
}
]
}
it would be handy to have an extra field contributors_count
that has value 2
and skip the keywords
field, like this:
{
"id": 1,
"title": "Invenio is awesome",
"contributors": [
{
"name": "Stark, Tony"
},
{
"name": "Kent, Clark"
}
],
"contributors_count": 2
}
Let's see how to do it.
If you completed the previous tutorial, you can skip this step. If instead you would like to start from a clean state run the following commands:
$ cd ~/src/training/
$ ./start-from.sh 09-deposit-form
We are going to take advantage of the invenio-indexer
signal before_record_index to modify the record fields before indexing.
This signal is called every time and just before indexing a record.
If it doesn't exist, create a new file indexer.py
and copy the following code:
my-site/my_site/records/indexer.py
"""Record modification prior to indexing."""
from __future__ import absolute_import, print_function
def indexer_receiver(
sender,
json=None,
record=None,
index=None,
doc_type=None,
arguments=None
):
"""Connect to before_record_index signal to transform record for ES.
:param sender: The Flask application
:param json: The dumped record dictionary which can be modified.
:param record: The record being indexed.
:param index: The index in which the record will be indexed.
:param doc_type: The doc_type for the record.
:param arguments: The arguments to pass to Elasticsearch for indexing.
"""
# delete the `keywords` field before indexing
if 'keywords' in json:
del json['keywords']
# count the number of contributors and add the new field
contributors = json.get('contributors', [])
json['contributors_count'] = len(contributors)
Now we need to register the signal in our Invenio instance. We have to connect the signal with our indexer at ext.py
in the init_app
of our extension.
my-site/my_site/records/ext.py
from __future__ import absolute_import, print_function
+from invenio_indexer.signals import before_record_index
+from .indexer import indexer_receiver
from . import config
...
def init_app(self, app):
"""Flask application initialization."""
self.init_config(app)
app.extensions['my-site'] = self
+ before_record_index.connect(indexer_receiver, sender=app, weak=False)
Finally, let's change the Elasticsearch mappings to update the fields that we have changed.
my-site/my_site/records/mappings/v7/records/record-v1.0.0.json
"id": {
"type": "keyword"
},
- "keywords": {
- "type": "keyword"
- },
"publication_date": {
"type": "date",
"format": "date"
},
+ "contributors_count": {
+ "type": "short"
+ },
"contributors": {
"type": "object",
"properties": {
The code is now ready and we can try it. Since we have changed the Elasticsearch mappings, we need to re-create them.
$ cd ~/src/my-site
$ pipenv run pip install -e .
$ pipenv run invenio index destroy --force --yes-i-know
$ pipenv run invenio index init --force
$ pipenv run invenio index queue init purge
$ ./scripts/server
In case you have a clean instance, we can create a record like this:
$ curl -k --header "Content-Type: application/json" \
--request POST \
--data '{"title": "Invenio is awesome", "contributors": [{"name": "Kent, Clark"}], "owner": 1}' \
"https://127.0.0.1:5000/api/records/?prettyprint=1"
Stop the server. Let's re-index all records:
$ cd ~/src/my-site
$ pipenv run invenio index reindex --pid-type recid --yes-i-know
$ pipenv run invenio index run
We can now create a new record, using the deposit of the previous exercise, and verify that in Elasticsearch we can see the modified fields.
$ ./scripts/server
$ firefox http://127.0.0.1:9200/records/_search?pretty=true
Let's try to add a record with more contributors:
$ curl -k --header "Content-Type: application/json" \
--request POST \
--data '{"title": "Invenio is awesome 2", "contributors": [{"name": "Kent, Clark"}, {"name": "Wayne, Bruce"}, {"name": "Stark, Tony"}], "owner": 1}' \
"https://127.0.0.1:5000/api/records/?prettyprint=1"
$ firefox http://127.0.0.1:9200/records/_search?pretty=true
The contributors_count
field for the last created record should have value 3
.
- We have seen how to connect to a signal
- We have learned how to modify data before indexing
- Finally, how to re-index all our records