Introduce tests for redaction to make changes easier #977

Blackoverflow · 2025-03-23T11:53:21Z

Adding tests in preparation to fix #316

In preparation to fix #316 'PDFs get way too large when redacting', I propose to add tests for the redaction logic.
This allows to execute that logic without to run the full application, reducing the feedback time.

Step 1:
A first characterization test, testing the redaction behavior with empty instructions.
Step 2:
A second characterization test, testing the redaction of the right halve of the pdf page.

Please tell me if you approve this approach.

I made a small refactoring that helped me to get the function under test.

The current approach

Right now, for every page with redactions:

The page is converted into an image
Redactions are drawn onto the image
The image is drawn as jpg onto a pdf page, which replaces the original page.
Invisible text is added above any text, which is still visible (but, because it is an image now, not select-able anymore)

This is quite a resource hungry approach, in cpu and file size.

Thoughts how to fix the file size problem:

An alternative approach must reliably remove any redacted content from the pdf and place black rectangles where the content was.

Placing the rectangles is the easy part:

Canvas.drawImage states this:

Unlike drawInlineImage, this creates 'external images' which
are only stored once in the PDF file but can be drawn many times.
If you give it the same filename twice, even at different locations
and sizes, it will reuse the first occurrence, resulting in a saving
in file size and generation time. If you use ImageReader objects,
it tests whether the image content has changed before deciding
whether to reuse it.

Since drawImage() can handle scaling, every rectangle could be the same 1x1px black image.

The hard part is to remove the content.
It must not only be covered, but completely removed from the pdf.

Ps: It seems the pipeline can't execute pdf_utils.get_image_from_pdf_page() because it depends on pdftoppm. Well, not my focus right now. :)

A first characterization test, testing the redaction behaviour with empty instructions.

A second characterization test, testing the redaction of the right halve of the pdf page.

Blackoverflow added 2 commits March 23, 2025 12:47

add redaction characterization test

6053b92

A first characterization test, testing the redaction behaviour with empty instructions.

add second redaction characterization test

a5e7728

A second characterization test, testing the redaction of the right halve of the pdf page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce tests for redaction to make changes easier #977

Introduce tests for redaction to make changes easier #977

Blackoverflow commented Mar 23, 2025 •

edited

Loading

Introduce tests for redaction to make changes easier #977

Are you sure you want to change the base?

Introduce tests for redaction to make changes easier #977

Conversation

Blackoverflow commented Mar 23, 2025 • edited Loading

Adding tests in preparation to fix #316

The current approach

Thoughts how to fix the file size problem:

Blackoverflow commented Mar 23, 2025 •

edited

Loading