This project implements a data pipeline using a microservices-oriented architecture. The pipeline processes PDF files uploaded via a Streamlit-based frontend, performs semantic chunking, stores processed data in a data lake, and loads vectorized data into a PostgreSQL database with pgvector support. Communication between services is facilitated using RabbitMQ.
```plaintext
├── docs/ # Project documents (ex: diagrams)
├── docker-compose.yml # Docker Structure
├── .env.development # Environment variable template
├── .gitignore
├── handle_create_knowledge_base/ # Database Initializer
├── handle_frontend/ # Frontend logic
├── handle_loading_knowledge/ # Loading knowledge at database
├── handle_semantic_chunking/ # Processing and chunking knowledge
├── README.md # Main documentation
└── README.pt-br.md # Main documentation in portuguese
- Frontend: A Streamlit-based web application for uploading PDF files.
- Message Broker: RabbitMQ for handling inter-service communication.
- Semantic Chunking: A microservice that performs semantic chunking of PDF documents and stores the processed data in a data lake.
- Data Storage: A PostgreSQL database with
pgvector
extension for managing vectorized data. - Data Loading: A microservice that loads vectorized data into the PostgreSQL database.
- Dockerized Deployment: All components are containerized and orchestrated using Docker Compose.
- Framework: Streamlit.
- Function: Uploads PDF files and sends them to RabbitMQ for processing.
- Communication: Sends messages to RabbitMQ.
- Role: Facilitates message passing between services.
- Queues:
knowledge_processing
: For semantic chunking tasks.knowledge_loading
: For data loading tasks.
- Task: Processes PDF documents into semantically meaningful chunks.
- Storage: Saves trusted and refined data versions in the data lake.
- Models:
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2
. - Tokenizer:
jinaai/jina-embeddings-v3
.
- Embedding Model:
- Communication: Publishes a message to RabbitMQ for the loading service.
- Task: Loads processed data into a PostgreSQL database with vector support (
pgvector
). - Input: Reads data from the data lake.
- Output: Stores vectorized data in the database.
- Technology: PostgreSQL with
pgvector
extension. - Purpose: Stores vectorized data for querying and retrieval.
- Configuration:
- User:
${DATABASE_USER}
- Password:
${DATABASE_PASS}
- Database:
${DATABASE_NAME}
- User:
The docker-compose.yml
orchestrates the following services:
database
: PostgreSQL with pgvector support.rabbitmq
: RabbitMQ with a management interface.data_chunking
: Handles semantic chunking of documents.loading_chunking
: Loads chunked data into the database.streamlit_app
: Frontend for uploading PDF files.
pgdata
: Persistent storage for PostgreSQL.knowledge_data
: Shared storage for the data lake.rabbitmq_logs
: Persistent RabbitMQ logs.
knowledge_pipeline_net
: Shared network for inter-service communication.
The following environment variables are required for the pipeline:
Variable Name | Description |
---|---|
DATABASE_USER |
PostgreSQL username. |
DATABASE_PASS |
PostgreSQL password. |
DATABASE_NAME |
PostgreSQL database name. |
RABBIT_USER |
RabbitMQ username. |
RABBIT_PASS |
RabbitMQ password. |
RABBIT_PORT |
RabbitMQ port (default: 5672). |
RABBIT_EXCHANGE |
RabbitMQ exchange name. |
Create a .env
file in the root directory and define the required environment variables.
Run the following command to start the pipeline:
docker-compose up --env-file ./.env up --build
Open your browser and navigate to http://localhost:8501 to access the Streamlit application.
- PDF Upload: User uploads a PDF via the frontend.
Message Passing: The file is sent to RabbitMQ (
knowledge_processing
queue). - Semantic Chunking:
- Chunking service processes the document.
- Trusted and refined data versions are saved in the data lake.
- A message is sent to the
knowledge_loading
queue.
- Data Loading: The loading service writes vectorized data to the PostgreSQL database.
- Database Querying: The stored data can be queried for downstream applications.
Docker: Ensure Docker is installed on your machine. Docker Compose: Ensure Docker Compose is installed.
Contributing Contributions are welcome! Feel free to submit issues and pull requests.