Skip to content

Commit

Permalink
Merge branch 'DEV' into custom_error_LLMGraphBuilderException
Browse files Browse the repository at this point in the history
  • Loading branch information
praveshkumar1988 authored Jan 14, 2025
2 parents 816aa33 + e8a4617 commit e29fae7
Show file tree
Hide file tree
Showing 88 changed files with 993 additions and 472 deletions.
34 changes: 9 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,27 +35,8 @@ Accoroding to enviornment we are configuring the models which is indicated by VI
EX:
```env
VITE_LLM_MODELS_PROD="openai_gpt_4o,openai_gpt_4o_mini,diffbot,gemini_1.5_flash"
```
According to the environment, we are configuring the models which indicated by VITE_LLM_MODELS_PROD variable we can configure models based on our needs.
EX:
```env
VITE_LLM_MODELS_PROD="openai_gpt_4o,openai_gpt_4o_mini,diffbot,gemini_1.5_flash"
```
if you only want OpenAI:
```env
VITE_LLM_MODELS_PROD="diffbot,openai-gpt-3.5,openai-gpt-4o"
VITE_LLM_MODELS_PROD="diffbot,openai-gpt-3.5,openai-gpt-4o"
OPENAI_API_KEY="your-openai-key"
```

if you only want Diffbot:
```env
VITE_LLM_MODELS_PROD="diffbot"
VITE_LLM_MODELS_PROD="diffbot"
DIFFBOT_API_KEY="your-diffbot-key"
```

You can then run Docker Compose to build and start all components:
```bash
docker-compose up --build
Expand Down Expand Up @@ -88,7 +69,6 @@ VITE_CHAT_MODES=""
If however you want to specify the only vector mode or only graph mode you can do that by specifying the mode in the env:
```env
VITE_CHAT_MODES="vector,graph"
VITE_CHAT_MODES="vector,graph"
```

#### Running Backend and Frontend separately (dev environment)
Expand All @@ -105,7 +85,7 @@ Alternatively, you can run the backend and frontend separately:
```

- For the backend:
1. Create the backend/.env file by copy/pasting the backend/example.env. To streamline the initial setup and testing of the application, you can preconfigure user credentials directly within the .env file. This bypasses the login dialog and allows you to immediately connect with a predefined user.
1. Create the backend/.env file by copy/pasting the backend/example.env. To streamline the initial setup and testing of the application, you can preconfigure user credentials directly within the backend .env file. This bypasses the login dialog and allows you to immediately connect with a predefined user.
- **NEO4J_URI**:
- **NEO4J_USERNAME**:
- **NEO4J_PASSWORD**:
Expand Down Expand Up @@ -139,6 +119,8 @@ Allow unauthenticated request : Yes
## ENV
| Env Variable Name | Mandatory/Optional | Default Value | Description |
|-------------------------|--------------------|---------------|--------------------------------------------------------------------------------------------------|
| |
| **BACKEND ENV**
| EMBEDDING_MODEL | Optional | all-MiniLM-L6-v2 | Model for generating the text embedding (all-MiniLM-L6-v2 , openai , vertexai) |
| IS_EMBEDDING | Optional | true | Flag to enable text embedding |
| KNN_MIN_SCORE | Optional | 0.94 | Minimum score for KNN algorithm |
Expand All @@ -152,7 +134,13 @@ Allow unauthenticated request : Yes
| LANGCHAIN_API_KEY | Optional | | API key for Langchain |
| LANGCHAIN_PROJECT | Optional | | Project for Langchain |
| LANGCHAIN_TRACING_V2 | Optional | true | Flag to enable Langchain tracing |
| GCS_FILE_CACHE | Optional | False | If set to True, will save the files to process into GCS. If set to False, will save the files locally |
| LANGCHAIN_ENDPOINT | Optional | https://api.smith.langchain.com | Endpoint for Langchain API |
| ENTITY_EMBEDDING | Optional | False | If set to True, It will add embeddings for each entity in database |
| LLM_MODEL_CONFIG_ollama_<model_name> | Optional | | Set ollama config as - model_name,model_local_url for local deployments |
| RAGAS_EMBEDDING_MODEL | Optional | openai | embedding model used by ragas evaluation framework |
| |
| **FRONTEND ENV**
| VITE_BACKEND_API_URL | Optional | http://localhost:8000 | URL for backend API |
| VITE_BLOOM_URL | Optional | https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true | URL for Bloom visualization |
| VITE_REACT_APP_SOURCES | Mandatory | local,youtube,wiki,s3 | List of input sources that will be available |
Expand All @@ -163,10 +151,6 @@ Allow unauthenticated request : Yes
| VITE_GOOGLE_CLIENT_ID | Optional | | Client ID for Google authentication |
| VITE_LLM_MODELS_PROD | Optional | openai_gpt_4o,openai_gpt_4o_mini,diffbot,gemini_1.5_flash | To Distinguish models based on the Enviornment PROD or DEV
| VITE_LLM_MODELS | Optional | 'diffbot,openai_gpt_3.5,openai_gpt_4o,openai_gpt_4o_mini,gemini_1.5_pro,gemini_1.5_flash,azure_ai_gpt_35,azure_ai_gpt_4o,ollama_llama3,groq_llama3_70b,anthropic_claude_3_5_sonnet' | Supported Models For the application
| GCS_FILE_CACHE | Optional | False | If set to True, will save the files to process into GCS. If set to False, will save the files locally |
| ENTITY_EMBEDDING | Optional | False | If set to True, It will add embeddings for each entity in database |
| LLM_MODEL_CONFIG_ollama_<model_name> | Optional | | Set ollama config as - model_name,model_local_url for local deployments |
| RAGAS_EMBEDDING_MODEL | Optional | openai | embedding model used by ragas evaluation framework |

## LLMs Supported
1. OpenAI
Expand Down
3 changes: 2 additions & 1 deletion backend/example.env
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,5 @@ LLM_MODEL_CONFIG_bedrock_claude_3_5_sonnet="model_name,aws_access_key_id,aws_sec
LLM_MODEL_CONFIG_ollama_llama3="model_name,model_local_url"
YOUTUBE_TRANSCRIPT_PROXY="https://user:pass@domain:port"
EFFECTIVE_SEARCH_RATIO=5

GRAPH_CLEANUP_MODEL="openai_gpt_4o"
CHUNKS_TO_BE_PROCESSED="50"
28 changes: 18 additions & 10 deletions backend/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from src.graphDB_dataAccess import graphDBdataAccess
from src.graph_query import get_graph_results,get_chunktext_results
from src.chunkid_entities import get_entities_from_chunkids
from src.post_processing import create_vector_fulltext_indexes, create_entity_embedding
from src.post_processing import create_vector_fulltext_indexes, create_entity_embedding, graph_schema_consolidation
from sse_starlette.sse import EventSourceResponse
from src.communities import create_communities
from src.neighbours import get_neighbour_nodes
Expand Down Expand Up @@ -187,7 +187,8 @@ async def extract_knowledge_graph_from_file(
allowedRelationship=Form(None),
language=Form(None),
access_token=Form(None),
retry_condition=Form(None)
retry_condition=Form(None),
additional_instructions=Form(None)
):
"""
Calls 'extract_graph_from_file' in a new thread to create Neo4jGraph from a
Expand All @@ -210,22 +211,22 @@ async def extract_knowledge_graph_from_file(
if source_type == 'local file':
merged_file_path = os.path.join(MERGED_DIR,file_name)
logging.info(f'File path:{merged_file_path}')
uri_latency, result = await extract_graph_from_file_local_file(uri, userName, password, database, model, merged_file_path, file_name, allowedNodes, allowedRelationship, retry_condition)
uri_latency, result = await extract_graph_from_file_local_file(uri, userName, password, database, model, merged_file_path, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)

elif source_type == 's3 bucket' and source_url:
uri_latency, result = await extract_graph_from_file_s3(uri, userName, password, database, model, source_url, aws_access_key_id, aws_secret_access_key, file_name, allowedNodes, allowedRelationship, retry_condition)
uri_latency, result = await extract_graph_from_file_s3(uri, userName, password, database, model, source_url, aws_access_key_id, aws_secret_access_key, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)

elif source_type == 'web-url':
uri_latency, result = await extract_graph_from_web_page(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition)
uri_latency, result = await extract_graph_from_web_page(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)

elif source_type == 'youtube' and source_url:
uri_latency, result = await extract_graph_from_file_youtube(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition)
uri_latency, result = await extract_graph_from_file_youtube(uri, userName, password, database, model, source_url, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)

elif source_type == 'Wikipedia' and wiki_query:
uri_latency, result = await extract_graph_from_file_Wikipedia(uri, userName, password, database, model, wiki_query, language, file_name, allowedNodes, allowedRelationship, retry_condition)
uri_latency, result = await extract_graph_from_file_Wikipedia(uri, userName, password, database, model, wiki_query, language, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)

elif source_type == 'gcs bucket' and gcs_bucket_name:
uri_latency, result = await extract_graph_from_file_gcs(uri, userName, password, database, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, access_token, file_name, allowedNodes, allowedRelationship, retry_condition)
uri_latency, result = await extract_graph_from_file_gcs(uri, userName, password, database, model, gcs_project_id, gcs_bucket_name, gcs_bucket_folder, gcs_blob_filename, access_token, file_name, allowedNodes, allowedRelationship, retry_condition, additional_instructions)
else:
return create_api_response('Failed',message='source_type is other than accepted source')
extract_api_time = time.time() - start_time
Expand Down Expand Up @@ -334,10 +335,15 @@ async def post_processing(uri=Form(), userName=Form(), password=Form(), database
await asyncio.to_thread(create_entity_embedding, graph)
api_name = 'post_processing/create_entity_embedding'
logging.info(f'Entity Embeddings created')

if "graph_schema_consolidation" in tasks :
await asyncio.to_thread(graph_schema_consolidation, graph)
api_name = 'post_processing/graph_schema_consolidation'
logging.info(f'Updated nodes and relationship labels')

if "enable_communities" in tasks:
api_name = 'create_communities'
await asyncio.to_thread(create_communities, uri, userName, password, database)
await asyncio.to_thread(create_communities, uri, userName, password, database)

logging.info(f'created communities')
graph = create_graph_database_connection(uri, userName, password, database)
Expand All @@ -347,9 +353,11 @@ async def post_processing(uri=Form(), userName=Form(), password=Form(), database
if count_response:
count_response = [{"filename": filename, **counts} for filename, counts in count_response.items()]
logging.info(f'Updated source node with community related counts')


end = time.time()
elapsed_time = end - start
json_obj = {'api_name': api_name, 'db_url': uri, 'userName':userName, 'database':database, 'tasks':tasks, 'logging_time': formatted_time(datetime.now(timezone.utc)), 'elapsed_api_time':f'{elapsed_time:.2f}'}
json_obj = {'api_name': api_name, 'db_url': uri, 'userName':userName, 'database':database, 'logging_time': formatted_time(datetime.now(timezone.utc)), 'elapsed_api_time':f'{elapsed_time:.2f}'}
logger.log_struct(json_obj)
return create_api_response('Success', data=count_response, message='All tasks completed successfully')

Expand Down
10 changes: 6 additions & 4 deletions backend/src/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from langchain_community.chat_models import ChatOllama
import boto3
import google.auth
from src.shared.constants import ADDITIONAL_INSTRUCTIONS

def get_llm(model: str):
"""Retrieve the specified language model based on the model name."""
Expand Down Expand Up @@ -160,14 +161,14 @@ def get_chunk_id_as_doc_metadata(chunkId_chunkDoc_list):


async def get_graph_document_list(
llm, combined_chunk_document_list, allowedNodes, allowedRelationship
llm, combined_chunk_document_list, allowedNodes, allowedRelationship, additional_instructions=None
):
futures = []
graph_document_list = []
if "diffbot_api_key" in dir(llm):
llm_transformer = llm
else:
if "get_name" in dir(llm) and llm.get_name() != "ChatOenAI" or llm.get_name() != "ChatVertexAI" or llm.get_name() != "AzureChatOpenAI":
if "get_name" in dir(llm) and llm.get_name() != "ChatOpenAI" or llm.get_name() != "ChatVertexAI" or llm.get_name() != "AzureChatOpenAI":
node_properties = False
relationship_properties = False
else:
Expand All @@ -180,6 +181,7 @@ async def get_graph_document_list(
allowed_nodes=allowedNodes,
allowed_relationships=allowedRelationship,
ignore_tool_usage=True,
additional_instructions=ADDITIONAL_INSTRUCTIONS+ (additional_instructions if additional_instructions else "")
)

if isinstance(llm,DiffbotGraphTransformer):
Expand All @@ -189,7 +191,7 @@ async def get_graph_document_list(
return graph_document_list


async def get_graph_from_llm(model, chunkId_chunkDoc_list, allowedNodes, allowedRelationship):
async def get_graph_from_llm(model, chunkId_chunkDoc_list, allowedNodes, allowedRelationship, additional_instructions=None):
try:
llm, model_name = get_llm(model)
combined_chunk_document_list = get_combined_chunks(chunkId_chunkDoc_list)
Expand All @@ -204,7 +206,7 @@ async def get_graph_from_llm(model, chunkId_chunkDoc_list, allowedNodes, allowed
allowedRelationship = allowedRelationship.split(',')

graph_document_list = await get_graph_document_list(
llm, combined_chunk_document_list, allowedNodes, allowedRelationship
llm, combined_chunk_document_list, allowedNodes, allowedRelationship, additional_instructions
)
return graph_document_list
except Exception as e:
Expand Down
Loading

0 comments on commit e29fae7

Please sign in to comment.