[Tux3] Comparison to Hammer fs design

Matthew Dillon dillon at apollo.backplane.com
Tue Aug 5 10:37:58 PDT 2008

:The payoff for a clean frontend/backend separation is, as you say, a
:clean design.  The art in this is to accomplish the separation without
:any throughput compromise, which I am willing to believe you have done
:in Hammer, and which I now wish to achieve by a somewhat different
:means in Tux3.

    It's been a major advantage for HAMMER.   The major improvement I made
    to make sure throughput was retained was to create a space reservation
    subsystem based around a structure called hammer_reserve.  This allows
    the frontend to reserve space on the media without having to modify
    any meta-data or commit to the allocation.

    This in turn allows the frontend to reserve space for bulk data writes
    and actually perform the writes directly to the media, and then queue
    just the meta-data, along with the reservation to the backend.  The
    backend then finalizes the reservation (doing the actual allocation)
    at the same time it lays down the related meta-data.

    The result is that you get nice pipelining when writing lots of bulk
    data.  It is possible to literally write gigabytes of bulk data
    before the backend needs to flush any of the related meta-data.

    Separating the two out also made testing the filesystem a lot easier.
    Most of the frontend could be tested without a working backend.

:Note: when I write VFS I mean "virtual filesystem switch" and when you
:write VFS you mean "virtual filesystem", which in Linux is just called
:a filesystem.

    Yah :-)

:Now here is another way of looking at the difference between undo and
:redo.  Each of my redo log entries leaves the filesystem meta-metadata
:in a consistent state in the sense that all btree structure will be
:consistent.  But the logical metadata may be inconsistent, for example,
:a directory may have appear to have been deleted on disk while some
:deletions of files it contained were not persistently recorded.  I think
:one can take advantage of such partial, low level consistency in order
:to do a rapid scan and repair, preserving as much as possible of what
:the filesystem was doing before the crash of a busy system with lots of
:transactions in flight.

    Yah, I hit up against similar issues which I resolved with the
    'flush group' abstraction, which groups together dependant operations.
    If the group would become too large (e.g. you have an arbitrarily long
    chain of dependancies), it can break the chain by flushing out the
    inode at the break point with an adjusted nlinks count.  That way
    each flush group can be made logically consistent.

:>     phase of the B-Tree updates (on an inode-by-inode basis so actual
:>     blocking is virtually non-existant).  It is often possible to copy
:>     an entire directory tree without initiating a backend flush.
:I presume that if one userland process is blocked on fsync, the others
:are still able to continue dumping their operations into cache?

    Yah.  fsync() just queues the inode to the flusher and the flusher
    bangs away at it until its all out, then wakes up the waiting fsync

:>     Some blocking between the frontend and backend occurs but not a whole
:>     lot.  The backend flusher actually blocks against itself more often
:>     (it runs multiple threads), due to the locality of reference within
:>     the B-Tree for nearby inodes.
:There is no need for multiple flushing threads in Tux3, I will use bio
:submissions instead, which are completely asynchronous, returning
:completion status in interrupt mode as they do.

    The issue HAMMER had is that the flusher has to perform multiple media
    B-Tree operations, which can block on reads.  I wound up going with
    multiple threads to parallelize the blocked reads.

:It's a thought.  But anyway, I don't see a big downside to the PHTree
:approach.  It might even end up faster than HTree (which does not
:maintain dirent stability) by virtue of the slightly smaller (5-10%)
:cache footprint.  If classic HTree turns out to be faster, I could offer
:both as an option.  The code that differs between them should be about

    None of the standards require directory stability within the cached
    dirent buffer.  That is, if libc caches just the base seek point for
    a set of directory entries, there is no requirement that the positioning
    of elements beyond the first element at that base seek point be

    The standards only require stability at explicit seek points.
    In fact, the standards explicitly say that trying to seek around inside
    a directory after performing a create or delete has undefined operation.
    The only thing you can count on is a linear iteration of the

:>     You should be able to use 63 bits of the cookie, I don't know why
:>     you wouldn't use the high bit of the lsb 32 bit part.  There is no
:>     requirement that that bit be 0.  In fact, the RFC says the cookie is
:>     a 64 bit unsigned integer and you should be able to use all 64 bits.
:>     If linux is not allowing all 64 bits to be used then it's a serious
:>     bug in linux.  The cookies are supposed to be opaque, just like the
:>     file handle.
:You are right about the cookie 64 bitness, I don't know where I picked
:up that lore, mea culpa.

    I seem to recall linux having issues with those bits too, it did strike
    a note.  I don't remember from where but it was a long time ago. 
    Frankly, any linuxisms that don't allow the use of at least 63 bits
    need to be fixed in linux.

:>     I like :-).  I'll probably end up with multiple directory hash formats
:>     too before this is over.  Fortunately I have room in the inode to
:>     specify the format the directory will use.
:I overlooked the part where you mentioned giving some lexical ordering
:to your hash to get better locality between lexically related records. 
:I am not sure this is a win, it will require a lot of analysis, which
:starts to drift off the critical path for me, since crystallizing the
:transaction design is really a lot more important.
:An inverse idea to making the hash nonrandom is to make the goal inum
:depend somewhat on the name hash for a newly created file.  So
:relatively fewer PHTree index leaf nodes, which contain [hash, dirblock]
:pairs, will be have to be reloaded into cache during deletion of a
:massive directory in physical dirent order.  The goal inum has to depend
:mostly on the physical dirent location though, to avoid thrashing the
:inode table during mass creation, deletion or stat.  I am not sure
:whether there is a gain to be had vs keeping the random hash completely
:random, which does deliver a benefit in balancing the tree better.

    That's an interesting idea.

:>     BSD buffer cache buffers are backed by vm_page_array[], an array of
:>     VM pages, and the base of the buffer does not have to be page-aligned.
:>     So variable page sizes are not necessary.
:vm_page_array... is that like a Linux page cache?  (aka mapping, aka
:address_space)  Buffers on page cache pages do not have to be


:page-aligned in Linux, or rather, each page in Linux is either fully
:covered with buffers or has none at all.  What sucks about all this is
:having buffers _and_ page descriptors, where the two perform very
:similar functions and duplicate a lot of state, causing endless bugs
:due to skew in the cached state between the two kinds of objects.

    BSD buffers can overlap pages.  e.g. you can have 4K pages and 6K buffers,
    so every other page is overlapped by two buffers.  FreeBSD and DragonFly
    take care of keeping track what portions of the VM pages are actually
    dirty (and it gets more complicated if the pages are mapped into

:This post required a week to respond to because I thought I should also
:post some code in parallel.

    Don't wear yourself out!  This conversation has been fun and I've
    gotten a lot of good ideas out of it, plus a much better understanding
    of REDO.

					Matthew Dillon 
					<dillon at backplane.com>

More information about the Kernel mailing list