-
Notifications
You must be signed in to change notification settings - Fork 158
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
content: update pdf2md README and add SELF-HOSTING.md guide
- Loading branch information
1 parent
afeab01
commit e202855
Showing
7 changed files
with
204 additions
and
135 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
The MIT License (MIT) | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,153 +1,93 @@ | ||
# Contributing to PDF2MD | ||
<p align="center"> | ||
<img height="100" src="https://trieve.b-cdn.net/trieve-logo.png" alt="Trieve Logo"> | ||
</p> | ||
<p align="center"> | ||
<strong><a href="https://pdf2md.trieve.ai/redoc">API reference</a> | <a href="https://cal.com/nick.k/meet">Meet a Maintainer</a> | <a href="https://discord.gg/eBJXXZDB8z">Discord</a> | <a href="https://matrix.to/#/#trieve-general:trieve.ai">Matrix</a> | <a href="mailto:[email protected]">[email protected]</a> | ||
</strong> | ||
</p> | ||
|
||
## Project Setup | ||
<p align="center"> | ||
<a href="https://github.com/devflowinc/trieve/stargazers"> | ||
<img src="https://img.shields.io/github/stars/devflowinc/trieve.svg?style=flat&color=yellow" alt="Github stars"/> | ||
</a> | ||
<a href="https://discord.gg/CuJVfgZf54"> | ||
<img src="https://img.shields.io/discord/1130153053056684123.svg?label=Discord&logo=Discord&colorB=7289da&style=flat" alt="Join Discord"/> | ||
</a> | ||
<a href="https://matrix.to/#/#trieve-general:trieve.ai"> | ||
<img src="https://img.shields.io/badge/matrix-join-purple?style=flat&logo=matrix&logocolor=white" alt="Join Matrix"/> | ||
</a> | ||
</p> | ||
|
||
### Setup ENV's | ||
<h1 align="center">🦀 PDF2MD 🦀</h1> | ||
|
||
```bash | ||
cd server | ||
cp ../.env.dist .env | ||
``` | ||
<h2 align="center"> | ||
<b>Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.</b> | ||
</h2> | ||
|
||
### Start docker dependency services | ||
<h4 align="center">Written in Rust. Try at <a href="https://pdf2md.trieve.ai">pdf2md.trieve.ai</a>.</h4> | ||
|
||
- redis | ||
- s3 | ||
- clickhouse-db | ||
[![PDF2MD service preview](https://cdn.trieve.ai/pdf2md/pdf2md-preview.webp)](https://pdf2md.trieve.ai) | ||
|
||
```bash | ||
docker compose up -d | ||
``` | ||
## The Stack | ||
|
||
### Run Server + Workers | ||
There's no compelling reason why Rust is necessary for this, but we wanted to have some fun 😜. Everything is free and open source. You can self-host easily with `docker-compose` or `kube` following the [SELF-HOSTING guide here](https://github.com/devflowinc/trieve/tree/main/pdf2md/SELF-HOSTING.md). | ||
|
||
Strongly recommend using tmux or another multiplex system to handle the different proceses. | ||
- [minijinja templates](https://github.com/mitsuhiko/minijinja) for the [UI](https://pdf2md.trieve.ai) | ||
- there was no way I was going to write more JSX | ||
- [PDFObject](https://github.com/pipwerks/pdfobject) to view PDF's in the [demo UI](https://pdf2md.trieve.ai). | ||
- [actix/actix-web](https://github.com/actix/actix-web) for the HTTP server | ||
- [fun redis queue macro system](https://github.com/devflowinc/trieve/blob/main/pdf2md/server/src/operators/redis.rs#L7-L62) for worker pattern async processing | ||
- redis queues are a core part of our infra for Trieve, but we made our system a lot more repeatable with this macro | ||
- there will be a future release of this macro in an isolated crate | ||
- [Clickhouse](https://github.com/ClickHouse/ClickHouse) for task storage | ||
- we have had a surprising amount of Postgres issues (especially write locks) building Trieve, so Clickhouse as the primary data store here is cool | ||
- [MinIO S3](https://github.com/minio/minio) for file storage | ||
|
||
```bash | ||
cargo watch -x run #HTTP server | ||
cargo run --bin supervisor-worker | ||
cargo run --bin chunk-worker | ||
``` | ||
## How does PDF2MD work? | ||
|
||
### Run tailwindcss server for demo UI | ||
Workers horizontally scale on-demand to handle high volume periods. Usually `chunk-worker` needs to scale before `supervisor-worker`. Pages for a given `Task` stream in as the `chunk-worker` calls out to the LLM to get markdown for them. | ||
|
||
``` | ||
npx tailwindcss -i ./static/in.css -o ./static/output.css --watch | ||
``` | ||
### 1. HTTP server | ||
|
||
### Testing using the CLI | ||
1. HTTP server receives a base64 encoded PDF and decodes it | ||
3. Creates `FileTask` for document in ClickHouse | ||
4. Adds `FileTask` along with the base64 encoded file to `files_to_process` queue in Redis | ||
|
||
Make your changes then use the following to run: | ||
### 2. Supervisor Worker | ||
|
||
```bash | ||
cd cli | ||
cargo run -- help #or other command instead of help | ||
``` | ||
1. `supervisor-worker` continuously polls the `files_to_process` Redis queue until it grabs a `FileTask` and its base64 | ||
2. Decodes the base64 into a PDF and puts the PDf into S3 | ||
3. Splits the PDF into pages, converts them to JPEGs | ||
4. Puts each JPEG page image into S3 | ||
5. Pushes a `ChunkingTask` for each page into the `files_to_chunk` Redis queue | ||
|
||
## Deploying | ||
### 3. Chunk Worker | ||
|
||
### Docker Compose | ||
1. `chunk-worker` continuously polls the `files_to_chunk` Redis queue until it grabs a `ChunkingTask` | ||
2. Gets its page image from S3 | ||
3. Sends the image to the LLM provider at `LLM_BASE_URL` along with the `prompt` and `model` on the request to get markdown | ||
4. Updates the task with the markdown for the page | ||
|
||
Use the docker-compose-prod.yaml file to deploy the application. | ||
## Why Make This? | ||
|
||
```bash | ||
docker compose up -f docker-compose-prod.yaml -d | ||
``` | ||
Trieve has used [apache tika](https://tika.apache.org/) to process various filetypes for the past year which means that files with complex layouts and diagrams have been poorly ingested. | ||
|
||
You can either chose to build locally or pull the pre-built images from the docker hub. | ||
We saw [OmniAI](https://github.com/getomni-ai) launch [xerox](https://github.com/getomni-ai/zerox) and show that 4o-mini was a viable and cheap way to handle these filetypes and decided it was time to integrate something better than Tika into Trieve. | ||
|
||
#### Build Options | ||
##### Build On Machine: | ||
We previously lightly contributed to [Chunkr](https://github.com/lumina-ai-inc/chunkr) which is a more advanced system that leverages layout detection and dedicated OCR models to process documents, but still felt the need to build something ourselves since it was a bit complex to work into Trieve's local dev and self-hosting setup. Xerox's approach using just a VLLM was ideal and the path we went with. | ||
|
||
```bash | ||
docker compose up -f docker-compose-prod.yaml -d --build | ||
``` | ||
We wrote our own API server and pipeline using Rust, Redis queues, and Clickhouse in the Trieve-style to achieve this. Try it using our demo UI hosted at [pdf2md.trieve.ai](https://pdf2md.trieve.ai). | ||
|
||
##### Use Pre-built Images: | ||
```bash | ||
docker compose up -f docker-compose-prod.yaml -d --pull always | ||
``` | ||
## Roadmap | ||
|
||
#### Setup Caddy reverse proxy (optional) | ||
Please contribute if you can! We could use help 🙏. | ||
|
||
Setup a Caddyfile with the following content: | ||
1. Rename everything from `chunk` to `page` because we eventually decided that we would only deal PDF --> Markdown conversion and not chunking. Consider using [chonkie](https://github.com/bhavnicksm/chonkie) with the markdown output for this. | ||
2. Use [Clickhouse MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree) instead of updating `Task`'s in Clickhouse as that's more correct. | ||
3. `supervisor-worker` can get overwhelmed when it receives a large PDF as splitting into pages can take a while. There should be something better here. | ||
4. Users should be able to send a URL to a file instead of base64 encoding it if they have one because that's easier. | ||
5. Users should be able to point `PDF2MD` at an S3 bucket and let it process all of them automatically instead of having to send each file 1 by 1 🤮. | ||
|
||
```bash | ||
# Global options | ||
{ | ||
email [email protected] | ||
} | ||
--- | ||
|
||
# Define a site block for pdftomd.example.com | ||
pdftomd.example.com { | ||
reverse_proxy localhost:8081 | ||
} | ||
``` | ||
|
||
Start the caddy reverse proxy. This should also handle your ssl | ||
|
||
```bash | ||
sudo systemctl reload caddy.service | ||
``` | ||
|
||
### Kubernetes | ||
|
||
```bash | ||
kubectl apply -f k8s/ | ||
``` | ||
|
||
You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local` | ||
To access it from outside the cluster: | ||
- You can use a service of type `LoadBalancer` or `NodePort`. | ||
- You can setup an Ingress (by default, the ingress is enabled in the k8s files). | ||
|
||
#### Setup Ingress (optional) | ||
|
||
```bash | ||
kubectl get ingress | ||
``` | ||
|
||
##### GKE Ingress | ||
|
||
For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file. | ||
|
||
Here is an example of how it looks: | ||
|
||
```yaml | ||
apiVersion: networking.k8s.io/v1 | ||
kind: Ingress | ||
metadata: | ||
name: pdf2md-ingress | ||
annotations: | ||
kubernetes.io/ingress.class: "gce" | ||
spec: | ||
defaultBackend: | ||
service: | ||
name: pdf2md-api | ||
port: | ||
number: 80 | ||
``` | ||
NAME CLASS HOSTS ADDRESS PORTS AGE | ||
pdf2md-ingress <none> * 34.107.134.128 80 4h33m | ||
``` | ||
|
||
##### EKS Ingress | ||
|
||
For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file. | ||
|
||
```yaml | ||
apiVersion: networking.k8s.io/v1 | ||
kind: Ingress | ||
metadata: | ||
name: pdf2md-ingress | ||
annotations: | ||
kubernetes.io/ingress.class: "alb" | ||
spec: | ||
ingressClassName: "alb" | ||
defaultBackend: | ||
service: | ||
name: pdf2md-api | ||
port: | ||
number: 80 | ||
``` | ||
Made with ❤️ in San Francisco |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# Deploying | ||
|
||
### Docker Compose | ||
|
||
Use the docker-compose-prod.yaml file to deploy the application. | ||
|
||
```bash | ||
docker compose up -f docker-compose-prod.yaml -d | ||
``` | ||
|
||
You can either chose to build locally or pull the pre-built images from the docker hub. | ||
|
||
#### Build Options | ||
##### Build On Machine: | ||
|
||
```bash | ||
docker compose up -f docker-compose-prod.yaml -d --build | ||
``` | ||
|
||
##### Use Pre-built Images: | ||
```bash | ||
docker compose up -f docker-compose-prod.yaml -d --pull always | ||
``` | ||
|
||
#### Setup Caddy reverse proxy (optional) | ||
|
||
Setup a Caddyfile with the following content: | ||
|
||
```bash | ||
# Global options | ||
{ | ||
email [email protected] | ||
} | ||
|
||
# Define a site block for pdftomd.example.com | ||
pdftomd.example.com { | ||
reverse_proxy localhost:8081 | ||
} | ||
``` | ||
|
||
Start the caddy reverse proxy. This should also handle your ssl | ||
|
||
```bash | ||
sudo systemctl reload caddy.service | ||
``` | ||
|
||
### Kubernetes | ||
|
||
```bash | ||
kubectl apply -f k8s/ | ||
``` | ||
|
||
You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local` | ||
To access it from outside the cluster: | ||
- You can use a service of type `LoadBalancer` or `NodePort`. | ||
- You can setup an Ingress (by default, the ingress is enabled in the k8s files). | ||
|
||
#### Setup Ingress (optional) | ||
|
||
```bash | ||
kubectl get ingress | ||
``` | ||
|
||
##### GKE Ingress | ||
|
||
For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file. | ||
|
||
Here is an example of how it looks: | ||
|
||
```yaml | ||
apiVersion: networking.k8s.io/v1 | ||
kind: Ingress | ||
metadata: | ||
name: pdf2md-ingress | ||
annotations: | ||
kubernetes.io/ingress.class: "gce" | ||
spec: | ||
defaultBackend: | ||
service: | ||
name: pdf2md-api | ||
port: | ||
number: 80 | ||
``` | ||
NAME CLASS HOSTS ADDRESS PORTS AGE | ||
pdf2md-ingress <none> * 34.107.134.128 80 4h33m | ||
``` | ||
|
||
##### EKS Ingress | ||
|
||
For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file. | ||
|
||
```yaml | ||
apiVersion: networking.k8s.io/v1 | ||
kind: Ingress | ||
metadata: | ||
name: pdf2md-ingress | ||
annotations: | ||
kubernetes.io/ingress.class: "alb" | ||
spec: | ||
ingressClassName: "alb" | ||
defaultBackend: | ||
service: | ||
name: pdf2md-api | ||
port: | ||
number: 80 | ||
``` |
Oops, something went wrong.