-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Allow screenshot archival to be optionally disabled #1882
Comments
Hi @HeliosLHC, we could look into making this optional in the Helm chart, however screenshots and extracted text are both necessary for our QA features, so we'd have to find a way to make it clear to users that changing these settings would have an adverse effect on Quality Assurance. |
The screenshots pngs are 100K-300K (if even that) in size, so the total size is generally negligible in the overall size of a crawl. The time it takes to take them is also very small. It will most likely not make much difference in storage or resource consumption, and will affect usability of other features. @HeliosLHC is there a particular issue you're trying to solve? How big are the screenshots compared to rest of the crawl data? |
Hi @tw4l and @ikreymer, my primary use case in is to reduce the output size of crawls, especially for static text only sites. In one of my crawls, the output WACZ contained 60 MB of crawl data + 30 MB extracted text data and 900 MB of screenshots (thumbnail + view) which is a 10x size increase. This was a barebones text-heavy site with little to no images. For other more image/media heavy sites, I would expect this ratio to be lower (< 5x) as the actual crawl data becomes a much larger proportion of the output. As such, in such scenarios, the additional overhead of the screenshots are not as significant. I don't mind generation of thumbnails/views as they are useful for monitoring crawls, but an option to disable writing them to WACZ files would be useful. |
I am also hesitant to offer this option (especially on the user accessible side on a per-crawl basis). Screenshots will play an increasingly important role in our UI offering rich page previews as we continue to develop ReplayWeb.page and collections within Browsertrix. Perhaps there is some room for improvement for how we write the PNGs using OxiPNG or similar? |
What change would you like to see?
Screenshotting was enabled by default in #1518 for crawlers.
To reduce storage consumption, a configurable flag in the Helm chart or UI to disable writing screenshots to disk would be preferable.
Context
Disabling screenshots from being written will reduce disk usage for crawls that do not require screenshots.
The text was updated successfully, but these errors were encountered: