This application implements an advanced company search and ranking system using a hybrid search approach that combines vector similarity search (using PgVector) and traditional full-text search in PostgreSQL. This dual approach ensures both semantic relevance and keyword accuracy in search results, making it particularly effective for company discovery and ranking.
The system leverages a sophisticated hybrid search architecture that:
- Uses OpenAI embeddings to convert company descriptions into vector representations
- Implements PostgreSQL's full-text search capabilities for keyword matching
- Combines both approaches with a weighted scoring system for optimal ranking
- Utilizes GPT-4o for intelligent search result processing and summarization
- Utilizes Redis for caching frequently accessed data to improve performance
This hybrid approach provides more accurate and contextually relevant results compared to traditional keyword-only search systems.
- Backend: FastAPI
- Database: PostgreSQL with pgvector extension, Redis for caching
- Vector Embeddings: OpenAI API
- LLM Processing: GPT-4o
- Frontend: React
- Containerization: Docker
- ORM: SQLAlchemy
- Hybrid search combining vector similarity and full-text search
- Real-time company ranking based on search relevance
- Company information management (add/search) and retrieval
- LLM powered tool calling
- Docker-based application deployment
- Docker and Docker Compose
- OpenAI API key (for embeddings generation and LLM processing)
- Create a
.env
file in the Backend directory with the following variables:DATABASE_NAME=your_database_name DATABASE_USER=your_database_user DATABASE_PASSWORD=your_database_password DATABASE_URL=localhost DATABASE_PORT=5432 OPENAI_API_KEY=your_openai_api_key
-
Clone the repository:
git clone https://github.com/Syed007Hassan/Investment-Search.git cd Backend
-
Build and start the containers:
docker-compose up --build
-
Access the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
Note: The default docker-compose configuration does not include initial data loading. If you want to load initial sample data, you can modify the command in docker-compose.yml to:
command: >
bash -c "
python scripts/load_data.py &&
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
"
If you don't want to load initial sample data, you can modify the command in docker-compose.yml to:
command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
The system implements a sophisticated hybrid search approach combining two powerful search methodologies:
-
Vector Similarity Search (Semantic Search)
- Uses OpenAI's text-embedding-3-small model to convert company descriptions into 1536-dimensional vectors
- Stores these vectors in PostgreSQL using the pgvector extension
- Enables semantic understanding of search queries
-
Full-Text Search (Keyword Search)
- Utilizes PostgreSQL's built-in full-text search capabilities
- Performs exact and partial keyword matching
Let's break down this hybrid search query step by step:
-
First CTE (Common Table Expression) - Vector Search:
WITH vector_search AS ( SELECT id, RANK () OVER (ORDER BY embedding <=> :embedding) AS rank FROM "Company" ORDER BY embedding <=> :embedding LIMIT 20 )
- Creates a temporary result set named
vector_search
embedding <=> :embedding
: Calculates cosine distance between stored embeddings and query embeddingRANK() OVER
: Assigns ranks based on similarity (lower distance = better rank)LIMIT 20
: Takes top 20 most similar vectors- Vector distance ranges from 0-2, where 0 means vectors are identical and 2 means opposite
- Creates a temporary result set named
-
Second CTE - Full-text Search:
fulltext_search AS ( SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC) FROM "Company", plainto_tsquery('english', :query) query WHERE to_tsvector('english', content) @@ query ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC LIMIT 20 )
- Creates another temporary result set named
fulltext_search
to_tsvector('english', content)
: Converts content to searchable tokensplainto_tsquery('english', :query)
: Converts search query to search terms@@
: Text search match operatorts_rank_cd
: Calculates text search relevancy score (higher score means better match)LIMIT 20
: Takes top 20 best text matches
- Creates another temporary result set named
-
Final Combined Query:
SELECT COALESCE(vector_search.id, fulltext_search.id) AS id, COALESCE(1.0 / (:k + vector_search.rank), 0.0) + COALESCE(1.0 / (:k + fulltext_search.rank), 0.0) AS score FROM vector_search FULL OUTER JOIN fulltext_search ON vector_search.id = fulltext_search.id ORDER BY score DESC LIMIT 20
FULL OUTER JOIN
: Combines results from both searches, keeping all matches from eitherCOALESCE
for IDs: Ensures we capture matches from either search methodCOALESCE
for scoring: Handles cases where an item only matches one search type (defaults to 0)- Score calculation uses k=60 as normalization factor to:
- Prevent division by zero
- Normalize scores to a comparable range
- Reduce impact of small rank differences
ORDER BY score DESC
: Ranks final results by combined scoreLIMIT 20
: Returns top 20 combined results
Ranking Process:
-
Vector ranking:
- Lower cosine distance = better rank
- Score = 1/(60 + rank)
- Example: Rank 1 = 1/61 ≈ 0.0164
- Lower distance is better because:
- Cosine distance measures how far apart two vectors are in high-dimensional space
- Distance of 0: Vectors are identical (perfect semantic match)
- Distance of 1: Vectors are perpendicular (unrelated content)
- Distance of 2: Vectors point in opposite directions (opposite meaning)
- Therefore, smaller distances indicate closer semantic similarity
-
Text ranking:
- Higher ts_rank_cd = better rank
- Score = 1/(60 + rank)
- Example: Rank 2 = 1/62 ≈ 0.0161
- Higher ts_rank_cd is better because:
- It counts the number of matching terms
- Considers term frequency (how often terms appear)
- Weighs term proximity (how close terms are to each other)
- Accounts for term importance in the document
- Therefore, more matches and better quality matches result in higher scores
-
Final ranking:
- Combined score = vector_score + text_score
- Higher combined score = better overall match
- Normalization ensures fair combination despite different scoring scales
Consider the following example to illustrate the ranking process:
- Item A: vector_rank=1, text_rank=2
- Score = 1/61 + 1/62 ≈ 0.0328
- Item B: vector_rank=5, text_rank=1
- Score = 1/65 + 1/61 ≈ 0.0317
- Result: Item A ranks higher than Item B
This hybrid approach ensures that results are ranked considering both semantic similarity (vectors) and keyword relevance (text), providing a more comprehensive search result.
Demo.mov
-
Prepare OpenShift Resources
- Convert Docker Compose configuration to OpenShift compatible resources using Kompose
- Ensure all required images are accessible to OpenShift
-
Configure Storage
- Set up persistent volumes for PostgreSQL database
- Set up persistent volumes for Redis cache
- Configure volume claims for both databases
-
Configure Environment
- Create secrets for sensitive data (OPENAI_API_KEY, database credentials)
- Create configmaps for application configuration
- Set up network policies if required
-
Deploy Components
- Deploy PostgreSQL database with pgvector extension
- Deploy Redis cache service
- Deploy backend FastAPI application
- Deploy frontend React application
-
Configure Access
- Create routes for frontend and backend services
- Configure TLS/SSL if required
- Set up any required network policies
Note: Ensure all components have appropriate resource limits and health checks configured.