This code project demonstrates how the G2 engine may be used with an ElasticSearch indexing engine. ElasticSearch provides enhanced searching capabilities on entity data.
The G2 data repository contains data records and observations about known entities. It determines which records match/merge to become single resolved entities. These resolved entities can be indexed through the ElasticSearch engine, to provide more searchable data entities.
ElasticSearch stores its indexed entity data in a separate data repository than the G2 engine does. Thus, ElasticSearch and G2 must both be managed in order to keep them in sync.
At Senzing, we strive to create GitHub documentation in a "don't make me think" style. For the most part, instructions are copy and paste. Whenever thinking is needed, it's marked with a "thinking" icon 🤔. Whenever customization is needed, it's marked with a "pencil" icon ✏️. If the instructions are not clear, please let us know by opening a new Documentation issue describing where we can improve. Now on with the show...
- 🤔 - A "thinker" icon means that a little extra thinking may be required. Perhaps there are some choices to be made. Perhaps it's an optional step.
- ✏️ - A "pencil" icon means that the instructions may need modification before performing.
⚠️ - A "warning" icon means that something tricky is happening, so pay attention.
-
Space: This repository and demonstration require X GB free disk space.
-
Time: Budget 30 minutes to get the demonstration up-and-running, depending on CPU and network speeds.
-
Background knowledge: This repository assumes a working knowledge of:
- 🤔 Data needs to be loaded into a Senzing project to post to elasticsearch, if you don't have any data to load, or don't know how, visit our quickstart.
- Start an instance of elasticsearch and your favorite elastic search UI, kibana is recommended and will be assumed for the remainder of this demonstration. For guidance on how to get an instance of ES and kibana running vist our doc on How to Bring Up an ELK Stack.
-
✏️ Set local environment variables. These variables may be modified, but do not need to be modified. The variables are used throughout the installation procedure.
export GIT_ACCOUNT=senzing export GIT_REPOSITORY=elasticsearch export GIT_ACCOUNT_DIR=~/${GIT_ACCOUNT}.git export GIT_REPOSITORY_DIR="${GIT_ACCOUNT_DIR}/${GIT_REPOSITORY}"
-
Clone the repository
cd ${GIT_ACCOUNT_DIR} git clone https://github.com/Senzing/elasticsearch.git cd ${GIT_REPOSITORY_DIR}
-
🤔 Make sure the SENZING_ENGINE_CONFIGURATION_JSON environment variable is set to the Senzing installation that the data was loaded into earlier
-
🤔 Set elasticsearch local environment variables. The hostname and port must point towards the exposed port that the elasticsearch instance has. The index name can be anything; conforming to elasticsearch's index syntax.
export ELASTIC_HOSTNAME=senzing-elasticsearch export ELASTIC_PORT=9200 export ELASTIC_INDEX_NAME=g2index
-
Build the docker container.
cd {GIT_REPOSITORY_DIR} sudo docker build -t senzing/elasticsearch .
-
We will mount the sqlite database; make sure the
CONNECTION
string in our config json points to where it is mounted. In this example theCONNECTION
will need to point towards the/db
dir. We also need to run the container as part of the network that the ELK-stack is running in. Example:sudo --preserve-env docker run \ --interactive \ --rm \ --tty \ -e ELASTIC_HOSTNAME \ -e ELASTIC_PORT \ -e ELASTIC_INDEX_NAME \ -e SENZING_ENGINE_CONFIGURATION_JSON \ --network=senzing-network \ --volume ~/senzing/var/sqlite:/db \ senzing/elasticsearch
-
Here we won't need to mount a database, instead we can set our
CONNECTION
string in the config json to where the external database is. Example:export SENZING_ENGINE_CONFIGURATION_JSON='{ "PIPELINE": { "CONFIGPATH": "/etc/opt/senzing", "RESOURCEPATH": "/opt/senzing/g2/resources", "SUPPORTPATH": "/opt/senzing/data" }, "SQL": { "CONNECTION": "postgresql://postgres:postgres@senzing-postgres:5432:G2" } }'
-
Now we can run the container as part of the network that the ELK-stack is running in so that it can "see" the elasticsearch container. Example:
sudo --preserve-env docker run \ --interactive \ --rm \ --tty \ -e ELASTIC_HOSTNAME \ -e ELASTIC_PORT \ -e ELASTIC_INDEX_NAME \ -e SENZING_ENGINE_CONFIGURATION_JSON \ --network=senzing-network \ senzing/elasticsearch
-
Open up kibana in a web browser, default: localhost:5601
-
Navigate to the discover tab
-
Create Index.
- If all was done correctly, a new screen with a button to "Create data view" should appear.
- Click this and in the
index pattern
box type the name of the index that was created, this was theELASTIC_INDEX_NAME
variable set early, and should also appear on the right side of the popup. - The
Name
field can be set but is not required.
-
Press "Save data view to Kibana" at the bottom of the screen, now can view the created index and do searches. If fuzzy searches are needed click on "Saved Query" and switch the language to lucene. Here you can view the lucene syntax and how to do fuzzy searches