Skip to content

Latest commit

 

History

History
126 lines (90 loc) · 4.06 KB

README.md

File metadata and controls

126 lines (90 loc) · 4.06 KB

Entity Resolved Knowledge Graphs

This hands-on tutorial in Python demonstrates integration of Senzing and Neo4j to construct an Entity Resolved Knowledge Graph:

  1. Use three datasets describing businesses in Las Vegas: ~85K records, ~2% duplicates.
  2. Run entity resolution in Senzing to resolve duplicate business names and addresses.
  3. Parse results to construct a knowledge graph in Neo4j.
  4. Analyze and visualize the entity resolved knowledge graph.

We'll walk through example code based on Neo4j Desktop and the Graph Data Science (GDS) library to run Cypher queries on the graph, preparing data for downstream analysis and visualizations with Jupyter, Pandas, Seaborn, PyVis.

The code is simple to download and easy to follow, and presented so you can try it with your own data. Overall, this tutorial takes about 35 minutes total to run.

Before and After

Why? For one example, popular use of retrieval augmented generation (RAG) to make AI applications more robust has boosted recent interest in KGs. When the entities, relations, and properties in a KG leverage your domain-specific data to strengthen your AI app ... compliance issues and audits rush to the foreground.

TL;DR: sense-making of the data coming from a connected world. During the transition from data integration to KG construction, you need to make sure the entities in your graph get resolved correctly. Otherwise, your AI app downstream will struggle with the kinds of details that make people get concerned, very concerned, very quickly: e.g., billing, deliveries, voter registration, crucial medical details, credit reporting, industrial safety, security, and so on.

Highly recommended:

Prerequisites

In this tutorial we'll work in two environments. The configuration and coding are at a level which should be comfortable for most people working in data science. You'll need to have familiarity with how to:

  • clone a public repo from GitHub
  • launch a server in the cloud
  • use Linux command lines
  • write some code in Python

Total estimated project time: 35 minutes.

Cloud computing budget: running Senzing in this tutorial cost a total of $0.04 USD.

Set up local environment

After cloning this repo, connect into the ERKG directory and set up your local environment:

git clone https://github.com/DerwenAI/ERKG.git
cd ERKG

python3.11 -m venv venv
source venv/bin/activate

python3 -m pip install -U pip wheel setuptools
python3 -m pip install -r requirements.txt 

We're using Python 3.11 here, although this code should run with most of the recent Python 3.x versions.

Run the tutorial notebooks

First, launch Jupyter:

./venv/bin/jupyter lab

Then based on the tutorial, follow the steps shown in these notebooks:

  1. examples/datasets.ipynb
  2. examples/graph.ipynb
  3. examples/impact.ipynb

You can view the results -- an interactive visualization of the entity resolved knowledge graph -- by loading examples/big_vegas.2.html in a web browser. The full HTML+JavaScript is large and may take several minutes to load.

Deleting data

If you need to clear the database and start over, run this in Neo4j Desktop:

MATCH (n)
CALL {
  WITH n
  DETACH DELETE n
} IN TRANSACTIONS

See: https://neo4j.com/docs/cypher-manual/current/subqueries/subqueries-in-transactions/#delete-with-call-in-transactions

Kudos

Many thanks to: @akollegger, @brianmacy