IntraArchiveDeduplicator

Tool for managing data-deduplication within extant compressed archive files, with a heavy focus on Manga/Comic-book archive files.

This is a rather exotic tool that is intended to allow fairly fast duplicate detection for files within compressed archives.

It maintains a database of hashes for all files it scans, and actually recurses into compressed archives to scan the files within the archives, which should allow detection of archives with duplicate contents, even if the archives are compressed using different compression algorithms.

There are also facilities for searching by image similarity, using a custom tree system.

The image similarity system runs as a server on top of an existing PostgreSQL server, as it is implemented in python (actually Cython, but basically python). It's fairly memory hungry. Currently, ~12M hashes takes about 5 GB of RAM, or ~1Kbyte/hash. There is some room for optimization here.

Theoretically, each hash should take 64*8 + 8 + (8 * number of IDs at each node) (+ a few housekeeping) bytes. However, right now, a number of the node attributes are stored as hashtables (the child-links, for example), so they do not take as much space as they theoretically will if every child pointer pointed to a actual valid node.

Right now, the scanning and DB maintenance functionality is largely functional, but the logic to actually do deduplication is very preliminary. My MangaCMS project already has some support for detecting when a newly downloaded file has entirely duplicated content, and the automatic removal of the new file to prevent further introduction of duplicates.

Dependencies:

PostgreSQL >= 9.3 (Possibly any >9.0?)
Psycopg2
Cython
RPyC
Colorama
python-sql

For PHashing:

Numpy
Scipy
Pillow

For Unit testing:

Coverage.py
Bitstring

There are fairly extensive unit tests for the DB API, as well as the BK-tree and the phashing systems. However, the great majority of the tests (all the DB API tests, which are 80%+ of them) require a local postgres instance, so they're not suitable for CI integration.

BK-Tree

The BK-tree implementation has been broken out into an independent library installable via pip. It's hosted here. Many thanks to user @gpip for doing the legwork making it portable.

TODO: Moved counter in CPPBKTree delete operation doesn't work.

Long filenames break hasher? Bad filename: "00PGw6sr1r7fBlHub52mIRCQ4Nd5jTn0n31_2B3HvgCJTHDJcBK3qV0H7k7gdTwYTiaowENq0D8vK0hBDL_5d88vcInWqPRs4H8GZQYHRlzrHWUYNKiD0QRoeEOz2AztX4nF8v0=w1600"

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
Tests		Tests
cross_link		cross_link
deduplicator		deduplicator
inspector		inspector
interactive_tests		interactive_tests
perf-testing		perf-testing
scanner		scanner
server		server
wat		wat
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
Vagrantfile		Vagrantfile
bootstrap.sh		bootstrap.sh
dbApi.py		dbApi.py
dbPhashApi.py		dbPhashApi.py
graphApi.py		graphApi.py
invertDb.py		invertDb.py
loader.py		loader.py
main.py		main.py
mainWeb.py		mainWeb.py
pArch.py		pArch.py
profile_bk.py		profile_bk.py
run.sh		run.sh
runDedupServer.py		runDedupServer.py
runGrind.sh		runGrind.sh
runTests.sh		runTests.sh
settings.base.py		settings.base.py
startServer.py		startServer.py
test_settings.py		test_settings.py
valgrind-python.supp		valgrind-python.supp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IntraArchiveDeduplicator

BK-Tree

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

fake-name/IntraArchiveDeduplicator

Folders and files

Latest commit

History

Repository files navigation

IntraArchiveDeduplicator

BK-Tree

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages