Skip to content

Commit

Permalink
content: update pdf2md README and add SELF-HOSTING.md guide
Browse files Browse the repository at this point in the history
  • Loading branch information
skeptrunedev authored and cdxker committed Nov 21, 2024
1 parent afeab01 commit e202855
Show file tree
Hide file tree
Showing 7 changed files with 204 additions and 135 deletions.
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@
<a href="https://github.com/devflowinc/trieve/stargazers">
<img src="https://img.shields.io/github/stars/devflowinc/trieve.svg?style=flat&color=yellow" alt="Github stars"/>
</a>
<a href="https://github.com/devflowinc/trieve/issues">
<img src="https://img.shields.io/github/issues/devflowinc/trieve.svg?style=flat&color=success" alt="GitHub issues"/>
</a>
<a href="https://discord.gg/CuJVfgZf54">
<img src="https://img.shields.io/discord/1130153053056684123.svg?label=Discord&logo=Discord&colorB=7289da&style=flat" alt="Join Discord"/>
</a>
Expand All @@ -25,7 +22,7 @@
<b>All-in-one solution for search, recommendations, and RAG</b>
</h2>

[![Trieve dashboard preivew](https://cdn.trieve.ai/dashboard.webp)](https://dashboard.trieve.ai)
[![Trieve dashboard preivew](https://cdn.trieve.ai/landing-tabs/dark-mode-docsearch.webp)](https://dashboard.trieve.ai)

## Quick Links

Expand Down
2 changes: 1 addition & 1 deletion pdf2md/.env.dist
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@ LLM_MODEL=gpt-4o-mini
# PDF2MD HTTP API server
API_KEY=admin

# Chunkr - Get your API key from https://chunkr.ai
# OPTIONAL: Chunkr - Get your API key from https://chunkr.ai
CHUNKR_API_URL=https://api.chunkr.ai
CHUNKR_API_KEY=*********************
10 changes: 8 additions & 2 deletions pdf2md/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,16 @@

```bash
cd server
cp .env.dist .env
cp .env.dist ./server/.env
```

## Run dep processes
You will need to replace `LLM_API_KEY` with your key for OpenRouter, OpenAI, LiteLLM, or whichever OpenAI compliant API you are using with the `LLM_BASE_URL`.

If you want to support Chunkr then you can get an API key for their service from [chunkr.ai](https://chunkr.ai) and set it as the value for `CHUNKR_API_KEY`.

## Run dependency services

This will start MinIO S3, Clickhouse, and Redis.

```bash
docker compose --profile dev up -d
Expand Down
19 changes: 19 additions & 0 deletions pdf2md/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
194 changes: 67 additions & 127 deletions pdf2md/README.md
Original file line number Diff line number Diff line change
@@ -1,153 +1,93 @@
# Contributing to PDF2MD
<p align="center">
<img height="100" src="https://trieve.b-cdn.net/trieve-logo.png" alt="Trieve Logo">
</p>
<p align="center">
<strong><a href="https://pdf2md.trieve.ai/redoc">API reference</a> | <a href="https://cal.com/nick.k/meet">Meet a Maintainer</a> | <a href="https://discord.gg/eBJXXZDB8z">Discord</a> | <a href="https://matrix.to/#/#trieve-general:trieve.ai">Matrix</a> | <a href="mailto:[email protected]">[email protected]</a>
</strong>
</p>

## Project Setup
<p align="center">
<a href="https://github.com/devflowinc/trieve/stargazers">
<img src="https://img.shields.io/github/stars/devflowinc/trieve.svg?style=flat&color=yellow" alt="Github stars"/>
</a>
<a href="https://discord.gg/CuJVfgZf54">
<img src="https://img.shields.io/discord/1130153053056684123.svg?label=Discord&logo=Discord&colorB=7289da&style=flat" alt="Join Discord"/>
</a>
<a href="https://matrix.to/#/#trieve-general:trieve.ai">
<img src="https://img.shields.io/badge/matrix-join-purple?style=flat&logo=matrix&logocolor=white" alt="Join Matrix"/>
</a>
</p>

### Setup ENV's
<h1 align="center">🦀 PDF2MD 🦀</h1>

```bash
cd server
cp ../.env.dist .env
```
<h2 align="center">
<b>Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.</b>
</h2>

### Start docker dependency services
<h4 align="center">Written in Rust. Try at <a href="https://pdf2md.trieve.ai">pdf2md.trieve.ai</a>.</h4>

- redis
- s3
- clickhouse-db
[![PDF2MD service preview](https://cdn.trieve.ai/pdf2md/pdf2md-preview.webp)](https://pdf2md.trieve.ai)

```bash
docker compose up -d
```
## The Stack

### Run Server + Workers
There's no compelling reason why Rust is necessary for this, but we wanted to have some fun 😜. Everything is free and open source. You can self-host easily with `docker-compose` or `kube` following the [SELF-HOSTING guide here](https://github.com/devflowinc/trieve/tree/main/pdf2md/SELF-HOSTING.md).

Strongly recommend using tmux or another multiplex system to handle the different proceses.
- [minijinja templates](https://github.com/mitsuhiko/minijinja) for the [UI](https://pdf2md.trieve.ai)
- there was no way I was going to write more JSX
- [PDFObject](https://github.com/pipwerks/pdfobject) to view PDF's in the [demo UI](https://pdf2md.trieve.ai).
- [actix/actix-web](https://github.com/actix/actix-web) for the HTTP server
- [fun redis queue macro system](https://github.com/devflowinc/trieve/blob/main/pdf2md/server/src/operators/redis.rs#L7-L62) for worker pattern async processing
- redis queues are a core part of our infra for Trieve, but we made our system a lot more repeatable with this macro
- there will be a future release of this macro in an isolated crate
- [Clickhouse](https://github.com/ClickHouse/ClickHouse) for task storage
- we have had a surprising amount of Postgres issues (especially write locks) building Trieve, so Clickhouse as the primary data store here is cool
- [MinIO S3](https://github.com/minio/minio) for file storage

```bash
cargo watch -x run #HTTP server
cargo run --bin supervisor-worker
cargo run --bin chunk-worker
```
## How does PDF2MD work?

### Run tailwindcss server for demo UI
Workers horizontally scale on-demand to handle high volume periods. Usually `chunk-worker` needs to scale before `supervisor-worker`. Pages for a given `Task` stream in as the `chunk-worker` calls out to the LLM to get markdown for them.

```
npx tailwindcss -i ./static/in.css -o ./static/output.css --watch
```
### 1. HTTP server

### Testing using the CLI
1. HTTP server receives a base64 encoded PDF and decodes it
3. Creates `FileTask` for document in ClickHouse
4. Adds `FileTask` along with the base64 encoded file to `files_to_process` queue in Redis

Make your changes then use the following to run:
### 2. Supervisor Worker

```bash
cd cli
cargo run -- help #or other command instead of help
```
1. `supervisor-worker` continuously polls the `files_to_process` Redis queue until it grabs a `FileTask` and its base64
2. Decodes the base64 into a PDF and puts the PDf into S3
3. Splits the PDF into pages, converts them to JPEGs
4. Puts each JPEG page image into S3
5. Pushes a `ChunkingTask` for each page into the `files_to_chunk` Redis queue

## Deploying
### 3. Chunk Worker

### Docker Compose
1. `chunk-worker` continuously polls the `files_to_chunk` Redis queue until it grabs a `ChunkingTask`
2. Gets its page image from S3
3. Sends the image to the LLM provider at `LLM_BASE_URL` along with the `prompt` and `model` on the request to get markdown
4. Updates the task with the markdown for the page

Use the docker-compose-prod.yaml file to deploy the application.
## Why Make This?

```bash
docker compose up -f docker-compose-prod.yaml -d
```
Trieve has used [apache tika](https://tika.apache.org/) to process various filetypes for the past year which means that files with complex layouts and diagrams have been poorly ingested.

You can either chose to build locally or pull the pre-built images from the docker hub.
We saw [OmniAI](https://github.com/getomni-ai) launch [xerox](https://github.com/getomni-ai/zerox) and show that 4o-mini was a viable and cheap way to handle these filetypes and decided it was time to integrate something better than Tika into Trieve.

#### Build Options
##### Build On Machine:
We previously lightly contributed to [Chunkr](https://github.com/lumina-ai-inc/chunkr) which is a more advanced system that leverages layout detection and dedicated OCR models to process documents, but still felt the need to build something ourselves since it was a bit complex to work into Trieve's local dev and self-hosting setup. Xerox's approach using just a VLLM was ideal and the path we went with.

```bash
docker compose up -f docker-compose-prod.yaml -d --build
```
We wrote our own API server and pipeline using Rust, Redis queues, and Clickhouse in the Trieve-style to achieve this. Try it using our demo UI hosted at [pdf2md.trieve.ai](https://pdf2md.trieve.ai).

##### Use Pre-built Images:
```bash
docker compose up -f docker-compose-prod.yaml -d --pull always
```
## Roadmap

#### Setup Caddy reverse proxy (optional)
Please contribute if you can! We could use help 🙏.

Setup a Caddyfile with the following content:
1. Rename everything from `chunk` to `page` because we eventually decided that we would only deal PDF --> Markdown conversion and not chunking. Consider using [chonkie](https://github.com/bhavnicksm/chonkie) with the markdown output for this.
2. Use [Clickhouse MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree) instead of updating `Task`'s in Clickhouse as that's more correct.
3. `supervisor-worker` can get overwhelmed when it receives a large PDF as splitting into pages can take a while. There should be something better here.
4. Users should be able to send a URL to a file instead of base64 encoding it if they have one because that's easier.
5. Users should be able to point `PDF2MD` at an S3 bucket and let it process all of them automatically instead of having to send each file 1 by 1 🤮.

```bash
# Global options
{
email [email protected]
}
---

# Define a site block for pdftomd.example.com
pdftomd.example.com {
reverse_proxy localhost:8081
}
```

Start the caddy reverse proxy. This should also handle your ssl

```bash
sudo systemctl reload caddy.service
```

### Kubernetes

```bash
kubectl apply -f k8s/
```

You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local`
To access it from outside the cluster:
- You can use a service of type `LoadBalancer` or `NodePort`.
- You can setup an Ingress (by default, the ingress is enabled in the k8s files).

#### Setup Ingress (optional)

```bash
kubectl get ingress
```

##### GKE Ingress

For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file.

Here is an example of how it looks:

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: pdf2md-ingress
annotations:
kubernetes.io/ingress.class: "gce"
spec:
defaultBackend:
service:
name: pdf2md-api
port:
number: 80
```
NAME CLASS HOSTS ADDRESS PORTS AGE
pdf2md-ingress <none> * 34.107.134.128 80 4h33m
```

##### EKS Ingress

For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file.

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: pdf2md-ingress
annotations:
kubernetes.io/ingress.class: "alb"
spec:
ingressClassName: "alb"
defaultBackend:
service:
name: pdf2md-api
port:
number: 80
```
Made with ❤️ in San Francisco
107 changes: 107 additions & 0 deletions pdf2md/SELF-HOSTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Deploying

### Docker Compose

Use the docker-compose-prod.yaml file to deploy the application.

```bash
docker compose up -f docker-compose-prod.yaml -d
```

You can either chose to build locally or pull the pre-built images from the docker hub.

#### Build Options
##### Build On Machine:

```bash
docker compose up -f docker-compose-prod.yaml -d --build
```

##### Use Pre-built Images:
```bash
docker compose up -f docker-compose-prod.yaml -d --pull always
```

#### Setup Caddy reverse proxy (optional)

Setup a Caddyfile with the following content:

```bash
# Global options
{
email [email protected]
}

# Define a site block for pdftomd.example.com
pdftomd.example.com {
reverse_proxy localhost:8081
}
```

Start the caddy reverse proxy. This should also handle your ssl

```bash
sudo systemctl reload caddy.service
```

### Kubernetes

```bash
kubectl apply -f k8s/
```

You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local`
To access it from outside the cluster:
- You can use a service of type `LoadBalancer` or `NodePort`.
- You can setup an Ingress (by default, the ingress is enabled in the k8s files).

#### Setup Ingress (optional)

```bash
kubectl get ingress
```

##### GKE Ingress

For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file.

Here is an example of how it looks:

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: pdf2md-ingress
annotations:
kubernetes.io/ingress.class: "gce"
spec:
defaultBackend:
service:
name: pdf2md-api
port:
number: 80
```
NAME CLASS HOSTS ADDRESS PORTS AGE
pdf2md-ingress <none> * 34.107.134.128 80 4h33m
```

##### EKS Ingress

For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file.

```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: pdf2md-ingress
annotations:
kubernetes.io/ingress.class: "alb"
spec:
ingressClassName: "alb"
defaultBackend:
service:
name: pdf2md-api
port:
number: 80
```
Loading

0 comments on commit e202855

Please sign in to comment.