Easy way to find identify files which share some content/blocks
    Thomas Keusch 
    fwd+usenet-spam2011q2 at spam2011q2.bsd-solutions-duesseldorf.de
       
    Sun May  1 11:39:48 PDT 2011
    
    
  
Hello,
now that Dragonfly's HAMMER has got deduplication I ask myself if there
is a simple way to identify "pairs" or groups of files which share a lot
of data, i.e. are mostly identical.
I have a rather large repository of downloaded pictures, which contain
a lot of dupes in multiple locations. I have no problems finding those
given some time and a shell prompt.
I'm interested in identifying broken files. Broken in the sense that
A is an incomplete version of B (some bytes missing), or B a damaged
version of A (some additional bytes at the end).
Is there a way to get to something like this:
"File A shares 1234 (98.3%) data blocks with file B"
"File A shares xxxx (xx.x%) data blocks with file C"
Getting a step closer helps too.
Thanks for any insights.
Regards
Thomas
    
    
More information about the Users
mailing list