My thesis project at Seavus is a question-answering system leveraged by Neo4j that processes large, unstructured text data and fetches results for natural language processing (NLP) tasks ranging from keyword extraction to sentiment analysis. These tasks were made possible in Neo4j through the GraphAware NLP framework.
In order for this framework to function properly, you first need to add a few JAR plugins from both GraphAware and the Stanford CoreNLP. You will also want to download the graph algorithm library APOC that is freely available in Neo4j. It is required for some of the NLP tasks as well.
Requirements:
- Neo4j 3.5.14 (or earlier)
graphaware-server-allnlpnlp-stanfordnlpstanford-english-corenlpapoc
Once the above plugins are placed in neo4j.plugins in NEO4J_HOME/plugins/, these lines are required in the neo4j.conf file in NEO4J_HOME/conf/:
dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
dbms.security.procedures.whitelist=ga.nlp.*, apoc.*
dbms.security.procedures.unrestricted=ga.nlp.*, apoc.*
You will also need to allocate an appropriate heap size and page cache for Neo4j:
dbms.memory.heap.initial_size=3000m
dbms.memory.heap.max_size=5000m
In order to connect between Python and Neo4j, change the credentials in text_processor.py and query_pipeline.py to your specifics.
Example:
uri = 'bolt://localhost:7687'
username = 'neo4j'
password = 'gdb'
The BBC dataset used in these experiments were taken from the examples here. They are in archived format and can be processed in text_processor.py, which feeds the news articles into the graph database and defines the schema of the knowledge graph. There are additional methods to call for enrichment, keyword extraction, and text summarization.
After the text is proccessed in Neo4j, simply test out the demo_pipeline with the query_pipeline in the same folder: