CORE: Sophisticated PDFReader with Image and Table extraction #127

janaka · 2023-10-16T10:48:00Z

Current

The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the pypdf package. It iterates through pages (pypdf.pdfreader.pages) then uses the page.extract_text() method to grab to text for the document.

The following are ignored:

Images
Tables
PDF Metadata
Document structure such as headings

We should be able to improve retrieval by extracting information present in these components

Solution

Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.

Alternatives

Use readers from Unstructured.io

The text was updated successfully, but these errors were encountered:

janaka · 2023-10-23T13:11:47Z

Also look into LayoutPDFReader by LLMSherpa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE: Sophisticated PDFReader with Image and Table extraction #127

CORE: Sophisticated PDFReader with Image and Table extraction #127

janaka commented Oct 16, 2023 •

edited

Loading

janaka commented Oct 23, 2023

CORE: Sophisticated PDFReader with Image and Table extraction #127

CORE: Sophisticated PDFReader with Image and Table extraction #127

Comments

janaka commented Oct 16, 2023 • edited Loading

Current

Solution

Alternatives

janaka commented Oct 23, 2023

janaka commented Oct 16, 2023 •

edited

Loading