Detect duplicated pages by visually comparing them #3

stefnotch · 2022-11-30T22:06:25Z

Someone finally sent me some PDFs that have duplicated pages where the pages metadata got lost.

Here, the best way of identifying duplicates would probably be:

Comparing text (easy one, me thinks)
Comparing the visual output, and preferably checking if a lot of pixels have either become darker (usual slides: white background, dark foreground) or lighter (dark theme slides). This is rather slow. (Use a library like https://github.com/mapbox/pixelmatch )