Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Allow screenshot archival to be optionally disabled #1882

Open
HeliosLHC opened this issue Jun 24, 2024 · 4 comments
Open

[Feature]: Allow screenshot archival to be optionally disabled #1882

HeliosLHC opened this issue Jun 24, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@HeliosLHC
Copy link

What change would you like to see?

Screenshotting was enabled by default in #1518 for crawlers.

To reduce storage consumption, a configurable flag in the Helm chart or UI to disable writing screenshots to disk would be preferable.

Context

Disabling screenshots from being written will reduce disk usage for crawls that do not require screenshots.

@HeliosLHC HeliosLHC added the enhancement New feature or request label Jun 24, 2024
@tw4l
Copy link
Member

tw4l commented Jun 26, 2024

Hi @HeliosLHC, we could look into making this optional in the Helm chart, however screenshots and extracted text are both necessary for our QA features, so we'd have to find a way to make it clear to users that changing these settings would have an adverse effect on Quality Assurance.

@ikreymer
Copy link
Member

The screenshots pngs are 100K-300K (if even that) in size, so the total size is generally negligible in the overall size of a crawl. The time it takes to take them is also very small. It will most likely not make much difference in storage or resource consumption, and will affect usability of other features. @HeliosLHC is there a particular issue you're trying to solve? How big are the screenshots compared to rest of the crawl data?

@HeliosLHC
Copy link
Author

Hi @tw4l and @ikreymer, my primary use case in is to reduce the output size of crawls, especially for static text only sites.

In one of my crawls, the output WACZ contained 60 MB of crawl data + 30 MB extracted text data and 900 MB of screenshots (thumbnail + view) which is a 10x size increase. This was a barebones text-heavy site with little to no images.

For other more image/media heavy sites, I would expect this ratio to be lower (< 5x) as the actual crawl data becomes a much larger proportion of the output. As such, in such scenarios, the additional overhead of the screenshots are not as significant.

I don't mind generation of thumbnails/views as they are useful for monitoring crawls, but an option to disable writing them to WACZ files would be useful.

@Shrinks99
Copy link
Member

Shrinks99 commented Jul 1, 2024

I am also hesitant to offer this option (especially on the user accessible side on a per-crawl basis). Screenshots will play an increasingly important role in our UI offering rich page previews as we continue to develop ReplayWeb.page and collections within Browsertrix.

Perhaps there is some room for improvement for how we write the PNGs using OxiPNG or similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Triage
Development

No branches or pull requests

4 participants