[Tux3] Comparison to Hammer fs design

Fri Jul 25 12:01:38 PDT 2008

:Hi;
:
:The announcement of yet another filesystem:
:
:http://lkml.org/lkml/2008/7/23/257
:
:led to some comments about hammer fs:
:
:http://tux3.org/pipermail/tux3/2008-July/000006.html
:
:enjoy,
:
:     Pedro.

    Those are interesting comments.   I think I found Daniel's email address
    so I am adding him to the To:   Dan, feel free to post this on your Tux
    groups if you want.

    I did consider multiple-parentage...  that is the ability to have a
    writable snapshot that 'forks' the filesystem history.  It would be
    an ultra cool feature to have but I couldn't fit it into the B-Tree
    model I was using.  Explicit snapshotting would be needed to make it
    work, and the snapshot id would have to be made part of the B-Tree key,
    which is fine.  HAMMER is based around implicit snapshots (being able
    to do an as-of historical lookup without having explicitly snapshotted
    bits of the history).  I would caution against using B-Tree iterations
    in historical lookups, B-Trees are only fast if you can pretty much
    zero in on the exact element you want as part of the primary search
    mechanic.  Once you have to iterate or chain performance goes out the
    window.  I know this because there are still two places where HAMMER
    sometimes has to iterate due to the inexact nature of an as-of lookup.
    Multiple-parentage almost certainly require two inequality searches,
    or one inequality search and an iteration.  A single timeline only
    requires one inequality search.

    I couldn't get away from having a delete_tid (the 'death version
    numbers').  I really tried :-)  There are many cases where one is only
    deleting, rather then overwriting.   Both numbers are then needed to
    be able to properly present a historical view of the filesystem.
    For example, when one is deleting a directory entry a snapshot after
    that deletion must show the directory entry gone.  Both numbers are
    also needed to be able to properly prune out unwanted historical data
    from the filesystem.  HAMMER's pruning algorithm (cleaning out old 
    historical data which is no longer desired) creates holes in the
    sequence so once you start pruning out unwanted historical data
    the delete_tid of a prior record will not match the create_tid of the
    following one (historically speaking).

    At one point I did collapse the holes, rematching the delete_tid with
    the create_tid of the following historical record, but I had to remove
    that code because it seriously complicated the mirroring implementation.
    I wanted each mirror to be able to have its own historical retention
    policy independant of the master.  e.g. so you could mirror to a backup
    system which would retain a longer and more coarse-grained history then
    the production system.

    --

    I also considered increasing the B-Tree fan-out to 256 but decided
    against it because insertions and deletions really bloated up the
    UNDO FIFO.   Frankly I'm still undecided as to whether that was a good
    idea, I would have prefered 256.  I can actually compile in 256 by
    changing a #define, but I released with 64 because I hit up against
    a number of performance issues:  bcopy() overheads for insertions
    and deletions in certain tests became very apparent.  Locking
    conflicts became a much more serious issue because I am using whole-node
    locks rather then element locks.  And, finally, the UNDO records got
    really bloated.  I would need to adjust the node locking and UNDO
    generation code significantly to remove the bottlenecks before I could
    go to a 256-element B-Tree node.

    HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
    another limiting factor for the fan-out I can use.  My elements
    are 64 bytes each.  64x64 = 4K per B-Tree node.  I decided to go
    with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
    64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
    other auxillary info.  That's instead of using a radix-compressed key.
    Radix compressed keys would have doubled the complexity of the B-Tree
    code, particularly with the middle-of-tree pointer caching that
    HAMMER does.

    The historical access mechanic alone added major complexity to the
    B-Tree algorithms.  I will note here that B-Tree searches have a very
    high cpu overhead no matter how you twist it, and being able to cache
    cursors within the B-Tree is absolutely required if you want the
    filesystem to perform well.  If you always start searches at the root
    your cpu overhead will be horrendous... so plan on caching cursors
    from the get-go.

    If I were to do radix compression I would also want to go with a
    fully dynamic element size and fully dynamic fan-out in order to
    best-compress each B-Tree node.  Definitely food for thought.  I'd
    love to do something like that.  I think radix compression would
    remove much of the topological bloat the B-Tree creates verses using
    blockmaps, generally speaking.

    Space management is currently HAMMER's greatest weakness, but it only
    applies to small storage systems.  Several things in HAMMER's design
    are simply not condusive to a small storage.  The storage model is not
    fine-grained and (unless you are deleting a lot of stuff) needs
    reblocking to actually recover the freed space.  The flushing algorithms
    need around 100MB of UNDO FIFO space on-media to handle worst-case
    dependancies (mainly directory/file visibility issues on crash recovery),
    and the front-end caching of modifying operations, since they don't
    know exactly how much actual storage will be needed, need ~16MB of
    wiggle room to be able to estimate the on-media storage required to
    back the operations cached by the front-end.  Plus, on top of that,
    any sort of reblocking also needs some wiggle room to be able to clean
    out partially empty big-blocks and fill in new ones.

    On the flip side, the reblocker doesn't just de-fragment the filesystem,
    it is also intended to be used for expanding and contracting partitions,
    and adding removing volumes in the next release.  Definitely a
    multi-purpose utility.

    So I'm not actually being cavalier about it, its just that I had to
    make some major decisions on the algorithm design and I decided to
    weight the design more towards performance and large-storage, and
    small-storage suffered a bit.

    --

    In anycase, it sounds like Tux3 is using many similar ideas.  I think
    you are on the right track.  I will add one big note of caution, drawing
    from my experience implementing HAMMER, because I think you are going
    to hit a lot of the same issues.

    I spent 9 months designing HAMMER and 9 months implementing it.  During
    the course of implementing it I wound up throwing away probably 80% of
    the original design outright.  Amoung the major components I had to
    rewrite were the LOG mechanic (which ultimately became the meta-data
    UNDO FIFO), and the fine-grained storage mechanic (which ultimately
    became coarse-grained).  The UNDO FIFO was actually the saving grace,
    once that was implemented most of the writer-ordering dependancies went
    away (devolved into just flushing meta-data buffers after syncing the
    UNDO FIFO)... I suddenly had much, much more freedom in designing
    the other on-disk structures and algorithms.

    What I found while implementing HAMMER was that the on-disk topological
    design essentially dictated the extent of HAMMER's feature set AND
    most of its deficiencies (such as having to reblock to recover space),
    and the algorithms I chose often dictated other restrictions.  But the
    major restrictions came from the on-disk structures.

    Because of the necessarily tight integration between subsystems I found
    myself having to do major redesigns during the implementation phase.
    Fixing one subsystem created a cascade effect that required tweaking other
    subsystems.  Even adding new features, such as the mirroring, required
    significant changes in the B-Tree deadlock recovery code.  I couldn't get
    away from it.  Ultimately I chose to simplify some of the algorithms
    rather then have to go through another 4 months of rewriting.  All
    major designs are an exercise in making trade-offs in topology, feature
    set, algorithmic complexity, debuggability, robustness, etc.  The list
    goes on forever.

    Laters!

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>