You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PP uses Apache PDFBox to extract a text presentation of the PDF document for further processing.
By now, Apache PDFBox is released in version 3.0.3. We plan to use this issue to track an update of the library within PP.
Problem statement:
The process of importing works like this:
PDF ---(PDFBox)--> Plain Text ---(regex)--> Transactions
The challenge is that we have almost no test cases that take the PDF as input. Why? Because users want to provide anonymous content, we generate the text in the desktop application, let the user anonymize and then share the plain text. That means a) we cannot test if the new version of PDFBox creates the same text output and b) potentially break many importers which would require new sample files to fix the code.
Options:
Create a distribution of PP that contains both PDFBox versions: the old and the new (latest) version
Attempt to import the PDF using the new PDFBox version
If that fails, attempt to import the PDF the old PDFBox version
When creating a debug text document, use the new PDFBox version (--> collect examples in new version)
Technical considerations:
We must extract the PDFBox dependencies in separate bundles so that each bundle (old and new) and have a dependency to the old and new PDFBox
The text was updated successfully, but these errors were encountered:
PP uses Apache PDFBox to extract a text presentation of the PDF document for further processing.
By now, Apache PDFBox is released in version 3.0.3. We plan to use this issue to track an update of the library within PP.
Problem statement:
The process of importing works like this:
PDF ---(PDFBox)--> Plain Text ---(regex)--> Transactions
The challenge is that we have almost no test cases that take the PDF as input. Why? Because users want to provide anonymous content, we generate the text in the desktop application, let the user anonymize and then share the plain text. That means a) we cannot test if the new version of PDFBox creates the same text output and b) potentially break many importers which would require new sample files to fix the code.
Options:
Technical considerations:
The text was updated successfully, but these errors were encountered: