Overview
Implement PDF parser to extract document structure, metadata, and text while skipping binary image data.
Parent Epic
Part of #91 - Document & Office Format Awareness
Description
Parse PDF structure (objects, streams, cross-reference tables) and extract meaningful strings from metadata, annotations, bookmarks, and text streams.
Implementation Details
- Use
lopdf or pdf crate
- Parse PDF object structure
- Extract document info dictionary (Title, Author, Subject, Keywords)
- Parse catalog and page tree
- Extract text from content streams
- Identify and skip image streams
- Parse annotations and form fields
- Extract JavaScript from actions
String Sources
- Document metadata (Title, Author, Subject, Keywords, Creator, Producer)
- Bookmark titles
- Annotation text
- Form field names and values
- Font names
- JavaScript code
- Hyperlink URLs
- Named destinations
Acceptance Criteria
Test Cases
- Simple text PDFs
- PDFs with images
- PDFs with forms
- PDFs with JavaScript
- Encrypted PDFs
- Large PDFs (>100MB)
Related
Project: #76