Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive

Matthew Dillon dillon at apollo.backplane.com
Fri Jul 22 09:57:40 PDT 2011


:...
:> >> take?
:> >>
:> >
:> > I ran them one by one. at my own pace but the biggest two
:> > simultaneously did not take more than 2 hrs.
:> > So I guess 2-3 hrs would be a nice approximation :-)
:> 
:> My experiences were different on a file system containing a lot of data  
:> (>2TB).
:> 
:> I didn't try dedup itself but a dedup-simulate already ran for more than  
:> two days (consuming a lot of memory in the process) before I finally  
:> cancelled it.
:
:	Most odd - I just tried a dedup-simulate on a 2TB filesystem with
:about 840GB used, it finished in about 30 seconds and reported a ratio of
:1.01 (dedup has been running automatically every night on this FS).
:
:-- 
:Steve O'Hara-Smith                          |   Directable Mirror Arrays

    I think this could be a case of the more CRC collisions we have, the
    more I/O dedup (or dedup-simulate) needs to issue to determine if the
    collision is an actual dup or if it's just a CRC collision and the
    data is different.

    The memory use can be bounded with some additional work on the software,
    if someone wants to have a go at it.  Basically the way you limit memory
    use is by dynamically limiting the CRC range that you observe in a pass.
    As you reach a self-imposed memory limit you reduce the CRC range and
    throw away out-of-range records.  Once the pass is done you start a new
    pass with the remaining range.  Rinse, repeat until the whole thing is
    done.

    That would make it possible to run de-dup with bounded memory.  However,
    the extra I/O's required to verify duplicate data cannot be avoided.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Users mailing list