Skip to content

Commit

Permalink
added docker instructions and fixed links on README
Browse files Browse the repository at this point in the history
  • Loading branch information
Ambika Sukla committed Jan 23, 2024
1 parent 22fb1fb commit e2ab5c3
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 12 deletions.
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# About

This repo provides the service code llmsherpa API to connect. This repo contains parsers for the following file formats:
This repo provides the service code for llmsherpa API to connect.
This repo contains custom RAG (retrieval augmented generation) friendly parsers for the following file formats:
### PDF
The PDF parser is a rule based parser which uses text co-ordinates (boundary box) data from nlmatics modified version of tika found here https://github.com/nlmatics/nlm-tika.
The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data from nlmatics modified version of tika found here https://github.com/nlmatics/nlm-tika.
The PDF parser works off text layer and also offers a OCR option (apply_ocr) to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a nlmatics modified version of tika which uses tesseract underneath.
Check out the notebook [pdf_visual_ingestor_step_by_step](notebooks/pdf_visual_ingestor_step_by_step.ipynb) to experiment directly with the PDF parser.

Expand All @@ -18,15 +19,15 @@ The PDF Parser offers the following features:
10. OCR with boundary boxes

### HTML
A special HTML parser that creates layout aware blocks that makes RAG performance better.
A special HTML parser that creates layout aware blocks to make RAG performance better with higher quality chunks.
### Text
A special text parser which tries to figure out lists, tables, headers etc. purely by looking at the text.
A special text parser which tries to figure out lists, tables, headers etc. purely by looking at the text and no visual, font or bbox information.
### DOCX, PPTX and any other format supported by Apache Tika
There are two ways to process these types of documents
- html output from tika for these file types is used and parsed by the html parser

## Nlm Modified Tika
Nlm modified version of Tika can be found here https://github.com/nlmatics/nlm-tika
Nlm modified version of Tika can be found in the 2.4.1-nlm branch here https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm
For convenience, a compiled jar file of the code is included in this repo in jars/ folder.
In some cases, your PDFs may result in errors in the Java server and you will need to modify the code there to resolve the issue and recompile the jar file.

Expand All @@ -46,7 +47,12 @@ In some cases, your PDFs may result in errors in the Java server and you will ne
python -m nlm_ingestor.ingestion_daemon
```
### Run the docker file
A docker image is available via github container registry. Before running the following code, you may need to authenticate with docker first
cat ~/TOKEN.txt | docker login https://ghcr.io -u USERNAME --password-stdin
where TOKEN.txt is the token you create as described here: https://docs.github.com/en/[email protected]/packages/working-with-a-github-packages-registry/working-with-the-docker-registry

```
docker pull ghcr.io/nlmatics/nlm-ingestor:latest
docker run nlm-ingestor-<version>
```
### Test the ingestor server
Expand Down
7 changes: 0 additions & 7 deletions kustomization.yaml

This file was deleted.

0 comments on commit e2ab5c3

Please sign in to comment.