Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Apache PDFBox to 3.x #4449

Open
buchen opened this issue Jan 3, 2025 · 0 comments
Open

Upgrade Apache PDFBox to 3.x #4449

buchen opened this issue Jan 3, 2025 · 0 comments
Labels

Comments

@buchen
Copy link
Member

buchen commented Jan 3, 2025

PP uses Apache PDFBox to extract a text presentation of the PDF document for further processing.

By now, Apache PDFBox is released in version 3.0.3. We plan to use this issue to track an update of the library within PP.

Problem statement:

The process of importing works like this:

PDF ---(PDFBox)--> Plain Text ---(regex)--> Transactions

The challenge is that we have almost no test cases that take the PDF as input. Why? Because users want to provide anonymous content, we generate the text in the desktop application, let the user anonymize and then share the plain text. That means a) we cannot test if the new version of PDFBox creates the same text output and b) potentially break many importers which would require new sample files to fix the code.

Options:

  • Create a distribution of PP that contains both PDFBox versions: the old and the new (latest) version
  • Attempt to import the PDF using the new PDFBox version
  • If that fails, attempt to import the PDF the old PDFBox version
  • When creating a debug text document, use the new PDFBox version (--> collect examples in new version)

Technical considerations:

  • We must extract the PDFBox dependencies in separate bundles so that each bundle (old and new) and have a dependency to the old and new PDFBox
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

1 participant