Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
Matthew Dillon
dillon at apollo.backplane.com
Fri Jul 22 09:57:40 PDT 2011
:...
:> >> take?
:> >>
:> >
:> > I ran them one by one. at my own pace but the biggest two
:> > simultaneously did not take more than 2 hrs.
:> > So I guess 2-3 hrs would be a nice approximation :-)
:>
:> My experiences were different on a file system containing a lot of data
:> (>2TB).
:>
:> I didn't try dedup itself but a dedup-simulate already ran for more than
:> two days (consuming a lot of memory in the process) before I finally
:> cancelled it.
:
: Most odd - I just tried a dedup-simulate on a 2TB filesystem with
:about 840GB used, it finished in about 30 seconds and reported a ratio of
:1.01 (dedup has been running automatically every night on this FS).
:
:--
:Steve O'Hara-Smith | Directable Mirror Arrays
I think this could be a case of the more CRC collisions we have, the
more I/O dedup (or dedup-simulate) needs to issue to determine if the
collision is an actual dup or if it's just a CRC collision and the
data is different.
The memory use can be bounded with some additional work on the software,
if someone wants to have a go at it. Basically the way you limit memory
use is by dynamically limiting the CRC range that you observe in a pass.
As you reach a self-imposed memory limit you reduce the CRC range and
throw away out-of-range records. Once the pass is done you start a new
pass with the remaining range. Rinse, repeat until the whole thing is
done.
That would make it possible to run de-dup with bounded memory. However,
the extra I/O's required to verify duplicate data cannot be avoided.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Users
mailing list