Skip to content

This project implements a data pipeline using microservices-oriented architecture for ingestion data into vector database

Notifications You must be signed in to change notification settings

emvalencaf/data-pipeline-knowledge-vector-database

Repository files navigation

Data Pipeline with Microservices Architecture

English version    Portuguese version

This project implements a data pipeline using a microservices-oriented architecture. The pipeline processes PDF files uploaded via a Streamlit-based frontend, performs semantic chunking, stores processed data in a data lake, and loads vectorized data into a PostgreSQL database with pgvector support. Communication between services is facilitated using RabbitMQ.

Project Structure

```plaintext
├── docs/                           # Project documents (ex: diagrams)
├── docker-compose.yml              # Docker Structure
├── .env.development                # Environment variable template
├── .gitignore
├── handle_create_knowledge_base/   # Database Initializer
├── handle_frontend/                # Frontend logic
├── handle_loading_knowledge/       # Loading knowledge at database
├── handle_semantic_chunking/       # Processing and chunking knowledge
├── README.md                       # Main documentation
└── README.pt-br.md                 # Main documentation in portuguese

Features

  • Frontend: A Streamlit-based web application for uploading PDF files.
  • Message Broker: RabbitMQ for handling inter-service communication.
  • Semantic Chunking: A microservice that performs semantic chunking of PDF documents and stores the processed data in a data lake.
  • Data Storage: A PostgreSQL database with pgvector extension for managing vectorized data.
  • Data Loading: A microservice that loads vectorized data into the PostgreSQL database.
  • Dockerized Deployment: All components are containerized and orchestrated using Docker Compose.

Architecture Overview

Architecture Diagram

1. Frontend Service

  • Framework: Streamlit.
  • Function: Uploads PDF files and sends them to RabbitMQ for processing.
  • Communication: Sends messages to RabbitMQ.

2. RabbitMQ Service

  • Role: Facilitates message passing between services.
  • Queues:
    • knowledge_processing: For semantic chunking tasks.
    • knowledge_loading: For data loading tasks.

3. Semantic Chunking Service

  • Task: Processes PDF documents into semantically meaningful chunks.
  • Storage: Saves trusted and refined data versions in the data lake.
  • Models:
    • Embedding Model: sentence-transformers/all-MiniLM-L6-v2.
    • Tokenizer: jinaai/jina-embeddings-v3.
  • Communication: Publishes a message to RabbitMQ for the loading service.

4. Loading Chunking Service

  • Task: Loads processed data into a PostgreSQL database with vector support (pgvector).
  • Input: Reads data from the data lake.
  • Output: Stores vectorized data in the database.

5. Database Service

  • Technology: PostgreSQL with pgvector extension.
  • Purpose: Stores vectorized data for querying and retrieval.
  • Configuration:
    • User: ${DATABASE_USER}
    • Password: ${DATABASE_PASS}
    • Database: ${DATABASE_NAME}

Docker Compose Configuration

The docker-compose.yml orchestrates the following services:

Services

  • database: PostgreSQL with pgvector support.
  • rabbitmq: RabbitMQ with a management interface.
  • data_chunking: Handles semantic chunking of documents.
  • loading_chunking: Loads chunked data into the database.
  • streamlit_app: Frontend for uploading PDF files.

Volumes

  • pgdata: Persistent storage for PostgreSQL.
  • knowledge_data: Shared storage for the data lake.
  • rabbitmq_logs: Persistent RabbitMQ logs.

Networks

  • knowledge_pipeline_net: Shared network for inter-service communication.

Environment Variables

The following environment variables are required for the pipeline:

Variable Name Description
DATABASE_USER PostgreSQL username.
DATABASE_PASS PostgreSQL password.
DATABASE_NAME PostgreSQL database name.
RABBIT_USER RabbitMQ username.
RABBIT_PASS RabbitMQ password.
RABBIT_PORT RabbitMQ port (default: 5672).
RABBIT_EXCHANGE RabbitMQ exchange name.

Usage

Step 1: Set Up Environment Variables

Create a .env file in the root directory and define the required environment variables.

Step 2: Build and Run the Services

Run the following command to start the pipeline:

docker-compose up --env-file ./.env up --build

Step 3: Access the Frontend

Open your browser and navigate to http://localhost:8501 to access the Streamlit application.


Workflow

  1. PDF Upload: User uploads a PDF via the frontend. Message Passing: The file is sent to RabbitMQ (knowledge_processing queue).
  2. Semantic Chunking:
    • Chunking service processes the document.
    • Trusted and refined data versions are saved in the data lake.
    • A message is sent to the knowledge_loading queue.
  3. Data Loading: The loading service writes vectorized data to the PostgreSQL database.
  4. Database Querying: The stored data can be queried for downstream applications.

Prerequisites

Docker: Ensure Docker is installed on your machine. Docker Compose: Ensure Docker Compose is installed.

Contributing Contributions are welcome! Feel free to submit issues and pull requests.

About

This project implements a data pipeline using microservices-oriented architecture for ingestion data into vector database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published