Skip to content

Introduce tests for redaction to make changes easier #977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Blackoverflow
Copy link
Contributor

@Blackoverflow Blackoverflow commented Mar 23, 2025

Adding tests in preparation to fix #316

In preparation to fix #316 'PDFs get way too large when redacting', I propose to add tests for the redaction logic.
This allows to execute that logic without to run the full application, reducing the feedback time.

Step 1:
A first characterization test, testing the redaction behavior with empty instructions.
Step 2:
A second characterization test, testing the redaction of the right halve of the pdf page.

Please tell me if you approve this approach.

I made a small refactoring that helped me to get the function under test.

The current approach

Right now, for every page with redactions:

  1. The page is converted into an image
  2. Redactions are drawn onto the image
  3. The image is drawn as jpg onto a pdf page, which replaces the original page.
  4. Invisible text is added above any text, which is still visible (but, because it is an image now, not select-able anymore)

This is quite a resource hungry approach, in cpu and file size.

Thoughts how to fix the file size problem:

An alternative approach must reliably remove any redacted content from the pdf and place black rectangles where the content was.

Placing the rectangles is the easy part:

Canvas.drawImage states this:

Unlike drawInlineImage, this creates 'external images' which
are only stored once in the PDF file but can be drawn many times.
If you give it the same filename twice, even at different locations
and sizes, it will reuse the first occurrence, resulting in a saving
in file size and generation time. If you use ImageReader objects,
it tests whether the image content has changed before deciding
whether to reuse it.

Since drawImage() can handle scaling, every rectangle could be the same 1x1px black image.

The hard part is to remove the content.
It must not only be covered, but completely removed from the pdf.

Ps: It seems the pipeline can't execute pdf_utils.get_image_from_pdf_page() because it depends on pdftoppm. Well, not my focus right now. :)

Blackoverflow added 2 commits March 23, 2025 12:47
A first characterization test, testing the redaction behaviour with empty instructions.
A second characterization test, testing the redaction of the right halve of the pdf page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDFs get way too large when redacting
1 participant