Easy way to find identify files which share some content/blocks

Thomas Keusch fwd+usenet-spam2011q2 at spam2011q2.bsd-solutions-duesseldorf.de
Sun May 1 11:39:48 PDT 2011


Hello,

now that Dragonfly's HAMMER has got deduplication I ask myself if there
is a simple way to identify "pairs" or groups of files which share a lot
of data, i.e. are mostly identical.

I have a rather large repository of downloaded pictures, which contain
a lot of dupes in multiple locations. I have no problems finding those
given some time and a shell prompt.

I'm interested in identifying broken files. Broken in the sense that
A is an incomplete version of B (some bytes missing), or B a damaged
version of A (some additional bytes at the end).

Is there a way to get to something like this:

"File A shares 1234 (98.3%) data blocks with file B"
"File A shares xxxx (xx.x%) data blocks with file C"

Getting a step closer helps too.

Thanks for any insights.


Regards
Thomas





More information about the Users mailing list