Easy way to find identify files which share some content/blocks

Mon May 2 09:55:55 PDT 2011

You could dump out the B-tree information.  I don't know how clear a
picture would come from that, and it may require some massaging of
data anyway since nonduplicated files may have some degree of
matching, duplicated data anyway, especially when dealing with larger
image file.

If you are sure that the corruption lies at the end of the files, you
could loop over the files, read the first x bytes of each, then MD5
that data.  Matching MD5 = matching file.

On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
<fwd+usenet-spam2011q2 at bsd-solutions-duesseldorf.de> wrote:
> Hello,
>
> now that Dragonfly's HAMMER has got deduplication I ask myself if there
> is a simple way to identify "pairs" or groups of files which share a lot
> of data, i.e. are mostly identical.
>
> I have a rather large repository of downloaded pictures, which contain
> a lot of dupes in multiple locations. I have no problems finding those
> given some time and a shell prompt.
>
> I'm interested in identifying broken files. Broken in the sense that
> A is an incomplete version of B (some bytes missing), or B a damaged
> version of A (some additional bytes at the end).
>
> Is there a way to get to something like this:
>
> "File A shares 1234 (98.3%) data blocks with file B"
> "File A shares xxxx (xx.x%) data blocks with file C"
>
> Getting a step closer helps too.
>
> Thanks for any insights.
>
>
> Regards
> Thomas
>