diff --git a/README.md b/README.md index 743a5c9ff8..254e3c6b5b 100644 --- a/README.md +++ b/README.md @@ -10,9 +10,6 @@ Github stars - - GitHub issues - Join Discord @@ -25,7 +22,7 @@ All-in-one solution for search, recommendations, and RAG -[![Trieve dashboard preivew](https://cdn.trieve.ai/dashboard.webp)](https://dashboard.trieve.ai) +[![Trieve dashboard preivew](https://cdn.trieve.ai/landing-tabs/dark-mode-docsearch.webp)](https://dashboard.trieve.ai) ## Quick Links diff --git a/pdf2md/.env.dist b/pdf2md/.env.dist index 6fd9317487..982f95e615 100644 --- a/pdf2md/.env.dist +++ b/pdf2md/.env.dist @@ -26,6 +26,6 @@ LLM_MODEL=gpt-4o-mini # PDF2MD HTTP API server API_KEY=admin -# Chunkr - Get your API key from https://chunkr.ai +# OPTIONAL: Chunkr - Get your API key from https://chunkr.ai CHUNKR_API_URL=https://api.chunkr.ai CHUNKR_API_KEY=********************* \ No newline at end of file diff --git a/pdf2md/CONTRIBUTING.md b/pdf2md/CONTRIBUTING.md index 12501286a7..f74cb7bf87 100644 --- a/pdf2md/CONTRIBUTING.md +++ b/pdf2md/CONTRIBUTING.md @@ -4,10 +4,16 @@ ```bash cd server -cp .env.dist .env +cp .env.dist ./server/.env ``` -## Run dep processes +You will need to replace `LLM_API_KEY` with your key for OpenRouter, OpenAI, LiteLLM, or whichever OpenAI compliant API you are using with the `LLM_BASE_URL`. + +If you want to support Chunkr then you can get an API key for their service from [chunkr.ai](https://chunkr.ai) and set it as the value for `CHUNKR_API_KEY`. + +## Run dependency services + +This will start MinIO S3, Clickhouse, and Redis. ```bash docker compose --profile dev up -d diff --git a/pdf2md/LICENSE b/pdf2md/LICENSE new file mode 100644 index 0000000000..d13cc4b26a --- /dev/null +++ b/pdf2md/LICENSE @@ -0,0 +1,19 @@ +The MIT License (MIT) + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/pdf2md/README.md b/pdf2md/README.md index 1a3a507134..0dca7e99b9 100644 --- a/pdf2md/README.md +++ b/pdf2md/README.md @@ -1,153 +1,93 @@ -# Contributing to PDF2MD +

+ Trieve Logo +

+

+API reference | Meet a Maintainer | Discord | Matrix | humans@trieve.ai + +

-## Project Setup +

+ + Github stars + + + Join Discord + + + Join Matrix + +

-### Setup ENV's +

🦀 PDF2MD 🦀

-```bash -cd server -cp ../.env.dist .env -``` +

+ Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5. +

-### Start docker dependency services +

Written in Rust. Try at pdf2md.trieve.ai.

-- redis -- s3 -- clickhouse-db +[![PDF2MD service preview](https://cdn.trieve.ai/pdf2md/pdf2md-preview.webp)](https://pdf2md.trieve.ai) -```bash -docker compose up -d -``` +## The Stack -### Run Server + Workers +There's no compelling reason why Rust is necessary for this, but we wanted to have some fun 😜. Everything is free and open source. You can self-host easily with `docker-compose` or `kube` following the [SELF-HOSTING guide here](https://github.com/devflowinc/trieve/tree/main/pdf2md/SELF-HOSTING.md). -Strongly recommend using tmux or another multiplex system to handle the different proceses. +- [minijinja templates](https://github.com/mitsuhiko/minijinja) for the [UI](https://pdf2md.trieve.ai) + - there was no way I was going to write more JSX +- [PDFObject](https://github.com/pipwerks/pdfobject) to view PDF's in the [demo UI](https://pdf2md.trieve.ai). +- [actix/actix-web](https://github.com/actix/actix-web) for the HTTP server +- [fun redis queue macro system](https://github.com/devflowinc/trieve/blob/main/pdf2md/server/src/operators/redis.rs#L7-L62) for worker pattern async processing + - redis queues are a core part of our infra for Trieve, but we made our system a lot more repeatable with this macro + - there will be a future release of this macro in an isolated crate +- [Clickhouse](https://github.com/ClickHouse/ClickHouse) for task storage + - we have had a surprising amount of Postgres issues (especially write locks) building Trieve, so Clickhouse as the primary data store here is cool +- [MinIO S3](https://github.com/minio/minio) for file storage -```bash -cargo watch -x run #HTTP server -cargo run --bin supervisor-worker -cargo run --bin chunk-worker -``` +## How does PDF2MD work? -### Run tailwindcss server for demo UI +Workers horizontally scale on-demand to handle high volume periods. Usually `chunk-worker` needs to scale before `supervisor-worker`. Pages for a given `Task` stream in as the `chunk-worker` calls out to the LLM to get markdown for them. -``` -npx tailwindcss -i ./static/in.css -o ./static/output.css --watch -``` +### 1. HTTP server -### Testing using the CLI +1. HTTP server receives a base64 encoded PDF and decodes it +3. Creates `FileTask` for document in ClickHouse +4. Adds `FileTask` along with the base64 encoded file to `files_to_process` queue in Redis -Make your changes then use the following to run: +### 2. Supervisor Worker -```bash -cd cli -cargo run -- help #or other command instead of help -``` +1. `supervisor-worker` continuously polls the `files_to_process` Redis queue until it grabs a `FileTask` and its base64 +2. Decodes the base64 into a PDF and puts the PDf into S3 +3. Splits the PDF into pages, converts them to JPEGs +4. Puts each JPEG page image into S3 +5. Pushes a `ChunkingTask` for each page into the `files_to_chunk` Redis queue -## Deploying +### 3. Chunk Worker -### Docker Compose +1. `chunk-worker` continuously polls the `files_to_chunk` Redis queue until it grabs a `ChunkingTask` +2. Gets its page image from S3 +3. Sends the image to the LLM provider at `LLM_BASE_URL` along with the `prompt` and `model` on the request to get markdown +4. Updates the task with the markdown for the page -Use the docker-compose-prod.yaml file to deploy the application. +## Why Make This? -```bash -docker compose up -f docker-compose-prod.yaml -d -``` +Trieve has used [apache tika](https://tika.apache.org/) to process various filetypes for the past year which means that files with complex layouts and diagrams have been poorly ingested. -You can either chose to build locally or pull the pre-built images from the docker hub. +We saw [OmniAI](https://github.com/getomni-ai) launch [xerox](https://github.com/getomni-ai/zerox) and show that 4o-mini was a viable and cheap way to handle these filetypes and decided it was time to integrate something better than Tika into Trieve. -#### Build Options -##### Build On Machine: +We previously lightly contributed to [Chunkr](https://github.com/lumina-ai-inc/chunkr) which is a more advanced system that leverages layout detection and dedicated OCR models to process documents, but still felt the need to build something ourselves since it was a bit complex to work into Trieve's local dev and self-hosting setup. Xerox's approach using just a VLLM was ideal and the path we went with. -```bash -docker compose up -f docker-compose-prod.yaml -d --build -``` +We wrote our own API server and pipeline using Rust, Redis queues, and Clickhouse in the Trieve-style to achieve this. Try it using our demo UI hosted at [pdf2md.trieve.ai](https://pdf2md.trieve.ai). -##### Use Pre-built Images: -```bash -docker compose up -f docker-compose-prod.yaml -d --pull always -``` +## Roadmap -#### Setup Caddy reverse proxy (optional) +Please contribute if you can! We could use help 🙏. -Setup a Caddyfile with the following content: +1. Rename everything from `chunk` to `page` because we eventually decided that we would only deal PDF --> Markdown conversion and not chunking. Consider using [chonkie](https://github.com/bhavnicksm/chonkie) with the markdown output for this. +2. Use [Clickhouse MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree) instead of updating `Task`'s in Clickhouse as that's more correct. +3. `supervisor-worker` can get overwhelmed when it receives a large PDF as splitting into pages can take a while. There should be something better here. +4. Users should be able to send a URL to a file instead of base64 encoding it if they have one because that's easier. +5. Users should be able to point `PDF2MD` at an S3 bucket and let it process all of them automatically instead of having to send each file 1 by 1 🤮. -```bash -# Global options -{ - email developer@example.com -} +--- -# Define a site block for pdftomd.example.com -pdftomd.example.com { - reverse_proxy localhost:8081 -} -``` - -Start the caddy reverse proxy. This should also handle your ssl - -```bash -sudo systemctl reload caddy.service -``` - -### Kubernetes - -```bash -kubectl apply -f k8s/ -``` - -You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local` -To access it from outside the cluster: -- You can use a service of type `LoadBalancer` or `NodePort`. -- You can setup an Ingress (by default, the ingress is enabled in the k8s files). - -#### Setup Ingress (optional) - -```bash -kubectl get ingress -``` - -##### GKE Ingress - -For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file. - -Here is an example of how it looks: - -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: pdf2md-ingress - annotations: - kubernetes.io/ingress.class: "gce" -spec: - defaultBackend: - service: - name: pdf2md-api - port: - number: 80 -``` - -NAME CLASS HOSTS ADDRESS PORTS AGE -pdf2md-ingress * 34.107.134.128 80 4h33m -``` - -##### EKS Ingress - -For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file. - -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: pdf2md-ingress - annotations: - kubernetes.io/ingress.class: "alb" -spec: - ingressClassName: "alb" - defaultBackend: - service: - name: pdf2md-api - port: - number: 80 -``` +Made with ❤️ in San Francisco diff --git a/pdf2md/SELF-HOSTING.md b/pdf2md/SELF-HOSTING.md new file mode 100644 index 0000000000..74f5f3fdb8 --- /dev/null +++ b/pdf2md/SELF-HOSTING.md @@ -0,0 +1,107 @@ +# Deploying + +### Docker Compose + +Use the docker-compose-prod.yaml file to deploy the application. + +```bash +docker compose up -f docker-compose-prod.yaml -d +``` + +You can either chose to build locally or pull the pre-built images from the docker hub. + +#### Build Options +##### Build On Machine: + +```bash +docker compose up -f docker-compose-prod.yaml -d --build +``` + +##### Use Pre-built Images: +```bash +docker compose up -f docker-compose-prod.yaml -d --pull always +``` + +#### Setup Caddy reverse proxy (optional) + +Setup a Caddyfile with the following content: + +```bash +# Global options +{ + email developer@example.com +} + +# Define a site block for pdftomd.example.com +pdftomd.example.com { + reverse_proxy localhost:8081 +} +``` + +Start the caddy reverse proxy. This should also handle your ssl + +```bash +sudo systemctl reload caddy.service +``` + +### Kubernetes + +```bash +kubectl apply -f k8s/ +``` + +You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local` +To access it from outside the cluster: +- You can use a service of type `LoadBalancer` or `NodePort`. +- You can setup an Ingress (by default, the ingress is enabled in the k8s files). + +#### Setup Ingress (optional) + +```bash +kubectl get ingress +``` + +##### GKE Ingress + +For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file. + +Here is an example of how it looks: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: pdf2md-ingress + annotations: + kubernetes.io/ingress.class: "gce" +spec: + defaultBackend: + service: + name: pdf2md-api + port: + number: 80 +``` + +NAME CLASS HOSTS ADDRESS PORTS AGE +pdf2md-ingress * 34.107.134.128 80 4h33m +``` + +##### EKS Ingress + +For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file. + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: pdf2md-ingress + annotations: + kubernetes.io/ingress.class: "alb" +spec: + ingressClassName: "alb" + defaultBackend: + service: + name: pdf2md-api + port: + number: 80 +``` diff --git a/pdf2md/server/src/templates/skeleton.html b/pdf2md/server/src/templates/skeleton.html index c5b77bbb81..3943a003a1 100644 --- a/pdf2md/server/src/templates/skeleton.html +++ b/pdf2md/server/src/templates/skeleton.html @@ -91,7 +91,7 @@ Meet With Sales