diff --git a/README.md b/README.md
index 743a5c9ff8..254e3c6b5b 100644
--- a/README.md
+++ b/README.md
@@ -10,9 +10,6 @@
-
-
-
@@ -25,7 +22,7 @@
All-in-one solution for search, recommendations, and RAG
-[![Trieve dashboard preivew](https://cdn.trieve.ai/dashboard.webp)](https://dashboard.trieve.ai)
+[![Trieve dashboard preivew](https://cdn.trieve.ai/landing-tabs/dark-mode-docsearch.webp)](https://dashboard.trieve.ai)
## Quick Links
diff --git a/pdf2md/.env.dist b/pdf2md/.env.dist
index 6fd9317487..982f95e615 100644
--- a/pdf2md/.env.dist
+++ b/pdf2md/.env.dist
@@ -26,6 +26,6 @@ LLM_MODEL=gpt-4o-mini
# PDF2MD HTTP API server
API_KEY=admin
-# Chunkr - Get your API key from https://chunkr.ai
+# OPTIONAL: Chunkr - Get your API key from https://chunkr.ai
CHUNKR_API_URL=https://api.chunkr.ai
CHUNKR_API_KEY=*********************
\ No newline at end of file
diff --git a/pdf2md/CONTRIBUTING.md b/pdf2md/CONTRIBUTING.md
index 12501286a7..f74cb7bf87 100644
--- a/pdf2md/CONTRIBUTING.md
+++ b/pdf2md/CONTRIBUTING.md
@@ -4,10 +4,16 @@
```bash
cd server
-cp .env.dist .env
+cp .env.dist ./server/.env
```
-## Run dep processes
+You will need to replace `LLM_API_KEY` with your key for OpenRouter, OpenAI, LiteLLM, or whichever OpenAI compliant API you are using with the `LLM_BASE_URL`.
+
+If you want to support Chunkr then you can get an API key for their service from [chunkr.ai](https://chunkr.ai) and set it as the value for `CHUNKR_API_KEY`.
+
+## Run dependency services
+
+This will start MinIO S3, Clickhouse, and Redis.
```bash
docker compose --profile dev up -d
diff --git a/pdf2md/LICENSE b/pdf2md/LICENSE
new file mode 100644
index 0000000000..d13cc4b26a
--- /dev/null
+++ b/pdf2md/LICENSE
@@ -0,0 +1,19 @@
+The MIT License (MIT)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/pdf2md/README.md b/pdf2md/README.md
index 1a3a507134..0dca7e99b9 100644
--- a/pdf2md/README.md
+++ b/pdf2md/README.md
@@ -1,153 +1,93 @@
-# Contributing to PDF2MD
+
+
+
+
+API reference | Meet a Maintainer | Discord | Matrix | humans@trieve.ai
+
+
-## Project Setup
+
+
+
+
+
+
+
+
+
+
+
-### Setup ENV's
+🦀 PDF2MD 🦀
-```bash
-cd server
-cp ../.env.dist .env
-```
+
+ Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.
+
-### Start docker dependency services
+
-- redis
-- s3
-- clickhouse-db
+[![PDF2MD service preview](https://cdn.trieve.ai/pdf2md/pdf2md-preview.webp)](https://pdf2md.trieve.ai)
-```bash
-docker compose up -d
-```
+## The Stack
-### Run Server + Workers
+There's no compelling reason why Rust is necessary for this, but we wanted to have some fun 😜. Everything is free and open source. You can self-host easily with `docker-compose` or `kube` following the [SELF-HOSTING guide here](https://github.com/devflowinc/trieve/tree/main/pdf2md/SELF-HOSTING.md).
-Strongly recommend using tmux or another multiplex system to handle the different proceses.
+- [minijinja templates](https://github.com/mitsuhiko/minijinja) for the [UI](https://pdf2md.trieve.ai)
+ - there was no way I was going to write more JSX
+- [PDFObject](https://github.com/pipwerks/pdfobject) to view PDF's in the [demo UI](https://pdf2md.trieve.ai).
+- [actix/actix-web](https://github.com/actix/actix-web) for the HTTP server
+- [fun redis queue macro system](https://github.com/devflowinc/trieve/blob/main/pdf2md/server/src/operators/redis.rs#L7-L62) for worker pattern async processing
+ - redis queues are a core part of our infra for Trieve, but we made our system a lot more repeatable with this macro
+ - there will be a future release of this macro in an isolated crate
+- [Clickhouse](https://github.com/ClickHouse/ClickHouse) for task storage
+ - we have had a surprising amount of Postgres issues (especially write locks) building Trieve, so Clickhouse as the primary data store here is cool
+- [MinIO S3](https://github.com/minio/minio) for file storage
-```bash
-cargo watch -x run #HTTP server
-cargo run --bin supervisor-worker
-cargo run --bin chunk-worker
-```
+## How does PDF2MD work?
-### Run tailwindcss server for demo UI
+Workers horizontally scale on-demand to handle high volume periods. Usually `chunk-worker` needs to scale before `supervisor-worker`. Pages for a given `Task` stream in as the `chunk-worker` calls out to the LLM to get markdown for them.
-```
-npx tailwindcss -i ./static/in.css -o ./static/output.css --watch
-```
+### 1. HTTP server
-### Testing using the CLI
+1. HTTP server receives a base64 encoded PDF and decodes it
+3. Creates `FileTask` for document in ClickHouse
+4. Adds `FileTask` along with the base64 encoded file to `files_to_process` queue in Redis
-Make your changes then use the following to run:
+### 2. Supervisor Worker
-```bash
-cd cli
-cargo run -- help #or other command instead of help
-```
+1. `supervisor-worker` continuously polls the `files_to_process` Redis queue until it grabs a `FileTask` and its base64
+2. Decodes the base64 into a PDF and puts the PDf into S3
+3. Splits the PDF into pages, converts them to JPEGs
+4. Puts each JPEG page image into S3
+5. Pushes a `ChunkingTask` for each page into the `files_to_chunk` Redis queue
-## Deploying
+### 3. Chunk Worker
-### Docker Compose
+1. `chunk-worker` continuously polls the `files_to_chunk` Redis queue until it grabs a `ChunkingTask`
+2. Gets its page image from S3
+3. Sends the image to the LLM provider at `LLM_BASE_URL` along with the `prompt` and `model` on the request to get markdown
+4. Updates the task with the markdown for the page
-Use the docker-compose-prod.yaml file to deploy the application.
+## Why Make This?
-```bash
-docker compose up -f docker-compose-prod.yaml -d
-```
+Trieve has used [apache tika](https://tika.apache.org/) to process various filetypes for the past year which means that files with complex layouts and diagrams have been poorly ingested.
-You can either chose to build locally or pull the pre-built images from the docker hub.
+We saw [OmniAI](https://github.com/getomni-ai) launch [xerox](https://github.com/getomni-ai/zerox) and show that 4o-mini was a viable and cheap way to handle these filetypes and decided it was time to integrate something better than Tika into Trieve.
-#### Build Options
-##### Build On Machine:
+We previously lightly contributed to [Chunkr](https://github.com/lumina-ai-inc/chunkr) which is a more advanced system that leverages layout detection and dedicated OCR models to process documents, but still felt the need to build something ourselves since it was a bit complex to work into Trieve's local dev and self-hosting setup. Xerox's approach using just a VLLM was ideal and the path we went with.
-```bash
-docker compose up -f docker-compose-prod.yaml -d --build
-```
+We wrote our own API server and pipeline using Rust, Redis queues, and Clickhouse in the Trieve-style to achieve this. Try it using our demo UI hosted at [pdf2md.trieve.ai](https://pdf2md.trieve.ai).
-##### Use Pre-built Images:
-```bash
-docker compose up -f docker-compose-prod.yaml -d --pull always
-```
+## Roadmap
-#### Setup Caddy reverse proxy (optional)
+Please contribute if you can! We could use help 🙏.
-Setup a Caddyfile with the following content:
+1. Rename everything from `chunk` to `page` because we eventually decided that we would only deal PDF --> Markdown conversion and not chunking. Consider using [chonkie](https://github.com/bhavnicksm/chonkie) with the markdown output for this.
+2. Use [Clickhouse MergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree) instead of updating `Task`'s in Clickhouse as that's more correct.
+3. `supervisor-worker` can get overwhelmed when it receives a large PDF as splitting into pages can take a while. There should be something better here.
+4. Users should be able to send a URL to a file instead of base64 encoding it if they have one because that's easier.
+5. Users should be able to point `PDF2MD` at an S3 bucket and let it process all of them automatically instead of having to send each file 1 by 1 🤮.
-```bash
-# Global options
-{
- email developer@example.com
-}
+---
-# Define a site block for pdftomd.example.com
-pdftomd.example.com {
- reverse_proxy localhost:8081
-}
-```
-
-Start the caddy reverse proxy. This should also handle your ssl
-
-```bash
-sudo systemctl reload caddy.service
-```
-
-### Kubernetes
-
-```bash
-kubectl apply -f k8s/
-```
-
-You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local`
-To access it from outside the cluster:
-- You can use a service of type `LoadBalancer` or `NodePort`.
-- You can setup an Ingress (by default, the ingress is enabled in the k8s files).
-
-#### Setup Ingress (optional)
-
-```bash
-kubectl get ingress
-```
-
-##### GKE Ingress
-
-For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file.
-
-Here is an example of how it looks:
-
-```yaml
-apiVersion: networking.k8s.io/v1
-kind: Ingress
-metadata:
- name: pdf2md-ingress
- annotations:
- kubernetes.io/ingress.class: "gce"
-spec:
- defaultBackend:
- service:
- name: pdf2md-api
- port:
- number: 80
-```
-
-NAME CLASS HOSTS ADDRESS PORTS AGE
-pdf2md-ingress * 34.107.134.128 80 4h33m
-```
-
-##### EKS Ingress
-
-For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file.
-
-```yaml
-apiVersion: networking.k8s.io/v1
-kind: Ingress
-metadata:
- name: pdf2md-ingress
- annotations:
- kubernetes.io/ingress.class: "alb"
-spec:
- ingressClassName: "alb"
- defaultBackend:
- service:
- name: pdf2md-api
- port:
- number: 80
-```
+Made with ❤️ in San Francisco
diff --git a/pdf2md/SELF-HOSTING.md b/pdf2md/SELF-HOSTING.md
new file mode 100644
index 0000000000..74f5f3fdb8
--- /dev/null
+++ b/pdf2md/SELF-HOSTING.md
@@ -0,0 +1,107 @@
+# Deploying
+
+### Docker Compose
+
+Use the docker-compose-prod.yaml file to deploy the application.
+
+```bash
+docker compose up -f docker-compose-prod.yaml -d
+```
+
+You can either chose to build locally or pull the pre-built images from the docker hub.
+
+#### Build Options
+##### Build On Machine:
+
+```bash
+docker compose up -f docker-compose-prod.yaml -d --build
+```
+
+##### Use Pre-built Images:
+```bash
+docker compose up -f docker-compose-prod.yaml -d --pull always
+```
+
+#### Setup Caddy reverse proxy (optional)
+
+Setup a Caddyfile with the following content:
+
+```bash
+# Global options
+{
+ email developer@example.com
+}
+
+# Define a site block for pdftomd.example.com
+pdftomd.example.com {
+ reverse_proxy localhost:8081
+}
+```
+
+Start the caddy reverse proxy. This should also handle your ssl
+
+```bash
+sudo systemctl reload caddy.service
+```
+
+### Kubernetes
+
+```bash
+kubectl apply -f k8s/
+```
+
+You can now access pdf2md within the kubernetes cluster at `http://pdf2md.default.svc.cluster.local`
+To access it from outside the cluster:
+- You can use a service of type `LoadBalancer` or `NodePort`.
+- You can setup an Ingress (by default, the ingress is enabled in the k8s files).
+
+#### Setup Ingress (optional)
+
+```bash
+kubectl get ingress
+```
+
+##### GKE Ingress
+
+For gke ingress, you need to set add `kubernetes.io/ingress.class` annotation to `gce` in the ingress yaml file.
+
+Here is an example of how it looks:
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+ name: pdf2md-ingress
+ annotations:
+ kubernetes.io/ingress.class: "gce"
+spec:
+ defaultBackend:
+ service:
+ name: pdf2md-api
+ port:
+ number: 80
+```
+
+NAME CLASS HOSTS ADDRESS PORTS AGE
+pdf2md-ingress * 34.107.134.128 80 4h33m
+```
+
+##### EKS Ingress
+
+For eks you need to set kubernetes.io/ingress.class to `alb` and set `spec.ingressClassName` to `alb` in the ingress yaml file.
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: Ingress
+metadata:
+ name: pdf2md-ingress
+ annotations:
+ kubernetes.io/ingress.class: "alb"
+spec:
+ ingressClassName: "alb"
+ defaultBackend:
+ service:
+ name: pdf2md-api
+ port:
+ number: 80
+```
diff --git a/pdf2md/server/src/templates/skeleton.html b/pdf2md/server/src/templates/skeleton.html
index c5b77bbb81..3943a003a1 100644
--- a/pdf2md/server/src/templates/skeleton.html
+++ b/pdf2md/server/src/templates/skeleton.html
@@ -91,7 +91,7 @@
Meet With Sales