Data Pipeline with Microservices Architecture

This project implements a data pipeline using a microservices-oriented architecture. The pipeline processes PDF files uploaded via a Streamlit-based frontend, performs semantic chunking, stores processed data in a data lake, and loads vectorized data into a PostgreSQL database with pgvector support. Communication between services is facilitated using RabbitMQ.

Project Structure

```plaintext
├── docs/                           # Project documents (ex: diagrams)
├── docker-compose.yml              # Docker Structure
├── .env.development                # Environment variable template
├── .gitignore
├── handle_create_knowledge_base/   # Database Initializer
├── handle_frontend/                # Frontend logic
├── handle_loading_knowledge/       # Loading knowledge at database
├── handle_semantic_chunking/       # Processing and chunking knowledge
├── README.md                       # Main documentation
└── README.pt-br.md                 # Main documentation in portuguese

Features

Frontend: A Streamlit-based web application for uploading PDF files.
Message Broker: RabbitMQ for handling inter-service communication.
Semantic Chunking: A microservice that performs semantic chunking of PDF documents and stores the processed data in a data lake.
Data Storage: A PostgreSQL database with pgvector extension for managing vectorized data.
Data Loading: A microservice that loads vectorized data into the PostgreSQL database.
Dockerized Deployment: All components are containerized and orchestrated using Docker Compose.

Architecture Overview

1. Frontend Service

Framework: Streamlit.
Function: Uploads PDF files and sends them to RabbitMQ for processing.
Communication: Sends messages to RabbitMQ.

2. RabbitMQ Service

Role: Facilitates message passing between services.
Queues:
- knowledge_processing: For semantic chunking tasks.
- knowledge_loading: For data loading tasks.

3. Semantic Chunking Service

Task: Processes PDF documents into semantically meaningful chunks.
Storage: Saves trusted and refined data versions in the data lake.
Models:
- Embedding Model: sentence-transformers/all-MiniLM-L6-v2.
- Tokenizer: jinaai/jina-embeddings-v3.
Communication: Publishes a message to RabbitMQ for the loading service.

4. Loading Chunking Service

Task: Loads processed data into a PostgreSQL database with vector support (pgvector).
Input: Reads data from the data lake.
Output: Stores vectorized data in the database.

5. Database Service

Technology: PostgreSQL with pgvector extension.
Purpose: Stores vectorized data for querying and retrieval.
Configuration:
- User: ${DATABASE_USER}
- Password: ${DATABASE_PASS}
- Database: ${DATABASE_NAME}

Docker Compose Configuration

The docker-compose.yml orchestrates the following services:

Services

database: PostgreSQL with pgvector support.
rabbitmq: RabbitMQ with a management interface.
data_chunking: Handles semantic chunking of documents.
loading_chunking: Loads chunked data into the database.
streamlit_app: Frontend for uploading PDF files.

Volumes

pgdata: Persistent storage for PostgreSQL.
knowledge_data: Shared storage for the data lake.
rabbitmq_logs: Persistent RabbitMQ logs.

Networks

knowledge_pipeline_net: Shared network for inter-service communication.

Environment Variables

The following environment variables are required for the pipeline:

Variable Name	Description
`DATABASE_USER`	PostgreSQL username.
`DATABASE_PASS`	PostgreSQL password.
`DATABASE_NAME`	PostgreSQL database name.
`RABBIT_USER`	RabbitMQ username.
`RABBIT_PASS`	RabbitMQ password.
`RABBIT_PORT`	RabbitMQ port (default: 5672).
`RABBIT_EXCHANGE`	RabbitMQ exchange name.

Usage

Step 1: Set Up Environment Variables

Create a .env file in the root directory and define the required environment variables.

Step 2: Build and Run the Services

Run the following command to start the pipeline:

docker-compose up --env-file ./.env up --build

Step 3: Access the Frontend

Open your browser and navigate to http://localhost:8501 to access the Streamlit application.

Workflow

PDF Upload: User uploads a PDF via the frontend. Message Passing: The file is sent to RabbitMQ (knowledge_processing queue).
Semantic Chunking:
- Chunking service processes the document.
- Trusted and refined data versions are saved in the data lake.
- A message is sent to the knowledge_loading queue.
Data Loading: The loading service writes vectorized data to the PostgreSQL database.
Database Querying: The stored data can be queried for downstream applications.

Prerequisites

Docker: Ensure Docker is installed on your machine. Docker Compose: Ensure Docker Compose is installed.

Contributing Contributions are welcome! Feel free to submit issues and pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Pipeline with Microservices Architecture

Project Structure

Features

Architecture Overview

1. Frontend Service

2. RabbitMQ Service

3. Semantic Chunking Service

4. Loading Chunking Service

5. Database Service

Docker Compose Configuration

Services

Volumes

Networks

Environment Variables

Usage

Step 1: Set Up Environment Variables

Step 2: Build and Run the Services

Step 3: Access the Frontend

Workflow

Prerequisites

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
handle_create_knowledge_base		handle_create_knowledge_base
handle_frontend		handle_frontend
handle_loading_knowledge		handle_loading_knowledge
handle_semantic_chunking		handle_semantic_chunking
.env.development		.env.development
.gitignore		.gitignore
README.md		README.md
README.pt-br.md		README.pt-br.md
docker-compose.yml		docker-compose.yml

emvalencaf/data-pipeline-knowledge-vector-database

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline with Microservices Architecture

Project Structure

Features

Architecture Overview

1. Frontend Service

2. RabbitMQ Service

3. Semantic Chunking Service

4. Loading Chunking Service

5. Database Service

Docker Compose Configuration

Services

Volumes

Networks

Environment Variables

Usage

Step 1: Set Up Environment Variables

Step 2: Build and Run the Services

Step 3: Access the Frontend

Workflow

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages