Merge branch 'DEV' into custom_error_LLMGraphBuilderException

neo4j-labs · Jan 14, 2025 · e29fae7 · e29fae7
2 parents 816aa33 + e8a4617
commit e29fae7
Show file tree

Hide file tree

Showing 88 changed files with 993 additions and 472 deletions.
diff --git a/README.md b/README.md
@@ -35,27 +35,8 @@ Accoroding to enviornment we are configuring the models which is indicated by VI
 EX:
 ```env
 VITE_LLM_MODELS_PROD="openai_gpt_4o,openai_gpt_4o_mini,diffbot,gemini_1.5_flash"
-```
-According to the environment, we are configuring the models which indicated by VITE_LLM_MODELS_PROD variable we can configure models based on our needs.
-EX:
-```env
-VITE_LLM_MODELS_PROD="openai_gpt_4o,openai_gpt_4o_mini,diffbot,gemini_1.5_flash"
-```
 
-if you only want OpenAI:
-```env
-VITE_LLM_MODELS_PROD="diffbot,openai-gpt-3.5,openai-gpt-4o"
-VITE_LLM_MODELS_PROD="diffbot,openai-gpt-3.5,openai-gpt-4o"
-OPENAI_API_KEY="your-openai-key"
 ```
-
-if you only want Diffbot:
-```env
-VITE_LLM_MODELS_PROD="diffbot"
-VITE_LLM_MODELS_PROD="diffbot"
-DIFFBOT_API_KEY="your-diffbot-key"
-```
-
 You can then run Docker Compose to build and start all components:
 ```bash
 docker-compose up --build
@@ -88,7 +69,6 @@ VITE_CHAT_MODES=""
 If however you want to specify the only vector mode or only graph mode you can do that by specifying the mode in the env:
 ```env
 VITE_CHAT_MODES="vector,graph"
-VITE_CHAT_MODES="vector,graph"
 ```
 
 #### Running Backend and Frontend separately (dev environment)
@@ -105,7 +85,7 @@ Alternatively, you can run the backend and frontend separately:
     ```
 
 - For the backend:
-1. Create the backend/.env file by copy/pasting the backend/example.env. To streamline the initial setup and testing of the application, you can preconfigure user credentials directly within the .env file. This bypasses the login dialog and allows you to immediately connect with a predefined user.
+1. Create the backend/.env file by copy/pasting the backend/example.env. To streamline the initial setup and testing of the application, you can preconfigure user credentials directly within the backend .env file. This bypasses the login dialog and allows you to immediately connect with a predefined user.
    - **NEO4J_URI**:
    - **NEO4J_USERNAME**:
    - **NEO4J_PASSWORD**:
@@ -139,6 +119,8 @@ Allow unauthenticated request : Yes
 ## ENV
 | Env Variable Name       | Mandatory/Optional | Default Value | Description                                                                                      |
 |-------------------------|--------------------|---------------|--------------------------------------------------------------------------------------------------|
+|                                                                                                                                                                 |
+| **BACKEND ENV** 
 | EMBEDDING_MODEL         | Optional           | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai)                |
 | IS_EMBEDDING            | Optional           | true          | Flag to enable text embedding                                                                    |
 | KNN_MIN_SCORE           | Optional           | 0.94          | Minimum score for KNN algorithm                                                                  |
@@ -152,7 +134,13 @@ Allow unauthenticated request : Yes
 | LANGCHAIN_API_KEY       | Optional           |               | API key for Langchain                                                                             |
 | LANGCHAIN_PROJECT       | Optional           |               | Project for Langchain                                                                             |
 | LANGCHAIN_TRACING_V2    | Optional           | true          | Flag to enable Langchain tracing                                                                  |
+| GCS_FILE_CACHE          | Optional           | False         | If set to True, will save the files to process into GCS. If set to False, will save the files locally   |
 | LANGCHAIN_ENDPOINT      | Optional           | https://api.smith.langchain.com | Endpoint for Langchain API                                                            |
+| ENTITY_EMBEDDING        | Optional           | False         | If set to True, It will add embeddings for each entity in database |
+| LLM_MODEL_CONFIG_ollama_<model_name>         | Optional      |               | Set ollama config as - model_name,model_local_url for local deployments |
+| RAGAS_EMBEDDING_MODEL         | Optional      | openai              | embedding model used by ragas evaluation framework                               |
+|                                                                                                                                                                        |
+| **FRONTEND ENV** 
 | VITE_BACKEND_API_URL         | Optional           | http://localhost:8000 | URL for backend API                                                                       |
 | VITE_BLOOM_URL               | Optional           | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true | URL for Bloom visualization |
 | VITE_REACT_APP_SOURCES       | Mandatory          | local,youtube,wiki,s3 | List of input sources that will be available                                               |
@@ -163,10 +151,6 @@ Allow unauthenticated request : Yes
 | VITE_GOOGLE_CLIENT_ID        | Optional           |               | Client ID for Google authentication                                                              |
 | VITE_LLM_MODELS_PROD         | Optional      | openai_gpt_4o,openai_gpt_4o_mini,diffbot,gemini_1.5_flash | To Distinguish models based on the Enviornment PROD or DEV 
 | VITE_LLM_MODELS              | Optional | 'diffbot,openai_gpt_3.5,openai_gpt_4o,openai_gpt_4o_mini,gemini_1.5_pro,gemini_1.5_flash,azure_ai_gpt_35,azure_ai_gpt_4o,ollama_llama3,groq_llama3_70b,anthropic_claude_3_5_sonnet' | Supported Models For the application
-| GCS_FILE_CACHE          | Optional           | False         | If set to True, will save the files to process into GCS. If set to False, will save the files locally   |
-| ENTITY_EMBEDDING        | Optional           | False         | If set to True, It will add embeddings for each entity in database |
-| LLM_MODEL_CONFIG_ollama_<model_name>         | Optional      |               | Set ollama config as - model_name,model_local_url for local deployments |
-| RAGAS_EMBEDDING_MODEL         | Optional      | openai              | embedding model used by ragas evaluation framework                               |
 
 ## LLMs Supported 
 1. OpenAI

diff --git a/backend/example.env b/backend/example.env
@@ -43,4 +43,5 @@ LLM_MODEL_CONFIG_bedrock_claude_3_5_sonnet="model_name,aws_access_key_id,aws_sec
 LLM_MODEL_CONFIG_ollama_llama3="model_name,model_local_url"
 YOUTUBE_TRANSCRIPT_PROXY="https://user:pass@domain:port"
 EFFECTIVE_SEARCH_RATIO=5
-
+GRAPH_CLEANUP_MODEL="openai_gpt_4o"
+CHUNKS_TO_BE_PROCESSED="50"
diff --git a/backend/score.py b/backend/score.py
@@ -13,7 +13,7 @@
 from src.graphDB_dataAccess import graphDBdataAccess
 from src.graph_query import get_graph_results,get_chunktext_results
 from src.chunkid_entities import get_entities_from_chunkids
-from src.post_processing import create_vector_fulltext_indexes, create_entity_embedding
+from src.post_processing import create_vector_fulltext_indexes, create_entity_embedding, graph_schema_consolidation
 from sse_starlette.sse import EventSourceResponse
 from src.communities import create_communities
 from src.neighbours import get_neighbour_nodes
@@ -187,7 +187,8 @@ async def extract_knowledge_graph_from_file(
     allowedRelationship=Form(None),
     language=Form(None),
     access_token=Form(None),
-    retry_condition=Form(None)
+    retry_condition=Form(None),
+    additional_instructions=Form(None)
 ):
     """
     Calls 'extract_graph_from_file' in a new thread to create Neo4jGraph from a
@@ -210,22 +211,22 @@ async def extract_knowledge_graph_from_file(
         if source_type == 'local file':
             merged_file_path = os.path.join(MERGED_DIR,file_name)
             logging.info(f'File path:{merged_file_path}')
-            uri_latency, result = await extract_graph_from_file_local_file(uri, userName, password, database, model, merged_file_path, file_name, allowedNodes, allowedRelationship, retry_condition)
+            uri_latency, result = await extract_graph_from_file_local_file(uri, userName, password, database, model, merged_file_path, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
 
         elif source_type == 's3 bucket' and source_url:
-            uri_latency, result = await extract_graph_from_file_s3(uri, userName, password, database, model, source_url, aws_access_key_id, aws_secret_access_key, file_name, allowedNodes, allowedRelationship, retry_condition)
+            uri_latency, result = await extract_graph_from_file_s3(uri, userName, password, database, model, source_url, aws_access_key_id, aws_secret_access_key, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
 
         elif source_type == 'web-url':
-            uri_latency, result = await extract_graph_from_web_page(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition)
+            uri_latency, result = await extract_graph_from_web_page(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
 
         elif source_type == 'youtube' and source_url:
-            uri_latency, result = await extract_graph_from_file_youtube(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition)
+            uri_latency, result = await extract_graph_from_file_youtube(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
 
         elif source_type == 'Wikipedia' and wiki_query:
-            uri_latency, result = await extract_graph_from_file_Wikipedia(uri, userName, password, database, model, wiki_query, language, file_name, allowedNodes, allowedRelationship, retry_condition)
+            uri_latency, result = await extract_graph_from_file_Wikipedia(uri, userName, password, database, model, wiki_query, language, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
 
         elif source_type == 'gcs bucket' and gcs_bucket_name:
-            uri_latency, result = await extract_graph_from_file_gcs(uri, userName, password, database, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, access_token, file_name, allowedNodes, allowedRelationship, retry_condition)
+            uri_latency, result = await extract_graph_from_file_gcs(uri, userName, password, database, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, access_token, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
         else:
             return create_api_response('Failed',message='source_type is other than accepted source')
         extract_api_time = time.time() - start_time
@@ -334,10 +335,15 @@ async def post_processing(uri=Form(), userName=Form(), password=Form(), database
             await asyncio.to_thread(create_entity_embedding, graph)
             api_name = 'post_processing/create_entity_embedding'
             logging.info(f'Entity Embeddings created')
+
+        if "graph_schema_consolidation" in tasks :
+            await asyncio.to_thread(graph_schema_consolidation, graph)
+            api_name = 'post_processing/graph_schema_consolidation'
+            logging.info(f'Updated nodes and relationship labels')
 
         if "enable_communities" in tasks:
             api_name = 'create_communities'
-            await asyncio.to_thread(create_communities, uri, userName, password, database)
+            await asyncio.to_thread(create_communities, uri, userName, password, database)  
 
             logging.info(f'created communities')
             graph = create_graph_database_connection(uri, userName, password, database)   
@@ -347,9 +353,11 @@ async def post_processing(uri=Form(), userName=Form(), password=Form(), database
             if count_response:
                 count_response = [{"filename": filename, **counts} for filename, counts in count_response.items()]
                 logging.info(f'Updated source node with community related counts')
+
+
         end = time.time()
         elapsed_time = end - start
-        json_obj = {'api_name': api_name, 'db_url': uri, 'userName':userName, 'database':database, 'tasks':tasks, 'logging_time': formatted_time(datetime.now(timezone.utc)), 'elapsed_api_time':f'{elapsed_time:.2f}'}
+        json_obj = {'api_name': api_name, 'db_url': uri, 'userName':userName, 'database':database, 'logging_time': formatted_time(datetime.now(timezone.utc)), 'elapsed_api_time':f'{elapsed_time:.2f}'}
         logger.log_struct(json_obj)
         return create_api_response('Success', data=count_response, message='All tasks completed successfully')
 

diff --git a/backend/src/llm.py b/backend/src/llm.py
@@ -13,6 +13,7 @@
 from langchain_community.chat_models import ChatOllama
 import boto3
 import google.auth
+from src.shared.constants import ADDITIONAL_INSTRUCTIONS
 
 def get_llm(model: str):
     """Retrieve the specified language model based on the model name."""
@@ -160,14 +161,14 @@ def get_chunk_id_as_doc_metadata(chunkId_chunkDoc_list):
 
 
 async def get_graph_document_list(
-    llm, combined_chunk_document_list, allowedNodes, allowedRelationship
+    llm, combined_chunk_document_list, allowedNodes, allowedRelationship, additional_instructions=None
 ):
     futures = []
     graph_document_list = []
     if "diffbot_api_key" in dir(llm):
         llm_transformer = llm
     else:
-        if "get_name" in dir(llm) and llm.get_name() != "ChatOenAI" or llm.get_name() != "ChatVertexAI" or llm.get_name() != "AzureChatOpenAI":
+        if "get_name" in dir(llm) and llm.get_name() != "ChatOpenAI" or llm.get_name() != "ChatVertexAI" or llm.get_name() != "AzureChatOpenAI":
             node_properties = False
             relationship_properties = False
         else:
@@ -180,6 +181,7 @@ async def get_graph_document_list(
             allowed_nodes=allowedNodes,
             allowed_relationships=allowedRelationship,
             ignore_tool_usage=True,
+            additional_instructions=ADDITIONAL_INSTRUCTIONS+ (additional_instructions if additional_instructions else "")
         )
 
     if isinstance(llm,DiffbotGraphTransformer):
@@ -189,7 +191,7 @@ async def get_graph_document_list(
     return graph_document_list
 
 
-async def get_graph_from_llm(model, chunkId_chunkDoc_list, allowedNodes, allowedRelationship):
+async def get_graph_from_llm(model, chunkId_chunkDoc_list, allowedNodes, allowedRelationship, additional_instructions=None):
     try:
         llm, model_name = get_llm(model)
         combined_chunk_document_list = get_combined_chunks(chunkId_chunkDoc_list)
@@ -204,7 +206,7 @@ async def get_graph_from_llm(model, chunkId_chunkDoc_list, allowedNodes, allowed
             allowedRelationship = allowedRelationship.split(',')
 
         graph_document_list = await get_graph_document_list(
-            llm, combined_chunk_document_list, allowedNodes, allowedRelationship
+            llm, combined_chunk_document_list, allowedNodes, allowedRelationship, additional_instructions
         )
         return graph_document_list
     except Exception as e: