[Tux3] Comparison to Hammer fs design

Matthew Dillon dillon at apollo.backplane.com
Sun Jul 27 20:55:59 PDT 2008


:Tux3 never backs anything out, so there is some skew to clear up.

    Got it.  I'm still not clear on the physical vs logical log
    interactions required for crash recovery, but I think I understand
    everything else.

:>     One of the reasons why HAMMER uses an UNDO log is that it is a lot
:>     less expensive then a REDO log.  I'll show you why:
:> 
:>     Physical operations related to a single flush group in HAMMER:
:> 
:> 	modify element in B-Tree node A	(create UNDO w/ original contents of
:> 					 B-Tree node A)
:> 	modify element in B-Tree node A	(no new UNDO needs to be created)
:> 	modify element in B-Tree node A	(no new UNDO needs to be created)
:> 	modify element in B-Tree node A	(no new UNDO needs to be created)
:> 	modify element in B-Tree node A	(no new UNDO needs to be created)
:
:Nice, now I understand.  But do you not have to hold all filesystem
:transactions that depend on the modification until the btree node has
:completed writing to disk?  With logical logging you only have to wait
:on the logical commit to complete, into which may be packed a number of
:other changes for efficiency.

    The only major interlock between frontend searches and backend flushes
    occurs on an inode-by-inode basis and only for the duration of the
    modifying B-Tree operation, which does *not* include the writing of
    the modified meta-data buffer to disk.  Once the operation concludes
    we are left with dirty meta-data buffers which for all intents and
    purposes have transfered the in-memory cached B-Tree records (held
    as separate entities in a red-black tree in-memory) into the media
    view of the B-Tree (having performed the actual B-Tree operations
    necessary to insert or delete the record), and the inode lock is
    then released.  The flusher flushes each inode on the flush list in
    turn, with certain long-lived inodes (e.g. truncation of a 1TB
    file to 0) only getting partial goodness and then being moved to
    the next flush list.  The inode lock is held for each individual
    inode as it is being flushed, one inode at a time.

    The actual dirty meta-data buffers are flushed later on, without
    any inode lock held and without having any blocking effect on the
    frontend.  This media-write sequence occurs at the end, after all
    inodes have been processed.  UNDO finishes writing, we wait for
    I/O to complete, we update the volume header, we wait for I/O
    to complete, then we asynchronously flush the dirty media buffers
    (and don't wait for that I/O to complete so the writes can be merged
    with the beginning of the next flush group).

    That is what I meant when I said the frontend was disconnected
    from the backend.  The worst the frontend ever does is issue reads
    on media meta-data as part of any merged memory-media B-Tree searches
    it must perform to locate information.  Any modifying operations made
    by the frontend will create records in the in-memory red-black tree,
    but will *NOT* do any modifying oprations on meta-data buffers
    representing the on-media B-Tree.  Frontend lookups are resolved by
    doing a merged-search with two cursors (one on the in-memory red-black
    tree, and one on the on-media B-Tree).

    I considered doing the interlock at the record level but hit
    some hicups in that one of the cursors may have already passed the
    memory record being committed but the other cursor (the media view)
    may not have yet passed the B-Tree index where that record would be
    written to.  So if a commit were to occur right then, without a
    governing inode lock, the merged search might see a duplicate record
    (or no record in either view in the case of a deletion).

:>     I just run the UNDOs backwards, which allows me to cache the in-flush
:>     undo areas with a fixed set of small throw-away tracking structures
:>     in system memory.  If a structure gets thrown away it just means an
:>     extranious UNDO record will be appended to the UNDO FIFO.
:
:I think I see it, but I have my doubts because you have to block
:transactions waiting for the up to date copy of the btree data to land
:on disk.  Either that, or you may give userspace the impression that
:some transaction has gotten onto stable storage when that is not the
:case.

    Userland doesn't care whether the actual transactions have made it
    to disk or not unless it actually calls fsync().  If it does then
    it will block through the necessary flush cycles.  fsync() on
    HAMMER is pretty slow as it will likely be syncing a lot more
    in the flush group then just the requested inode.

    If userland does not call fsync() (or 'sync', etc) then it doesn't
    block at all, except as described above during the modify-buffer
    phase of the B-Tree updates (on an inode-by-inode basis so actual
    blocking is virtually non-existant).  It is often possible to copy
    an entire directory tree without initiating a backend flush.

    Media B-Tree elements, inclusive of any held in dirty meta-data buffers,
    are locked on a B-Tree node by B-Tree node basis.  These are very
    short lived locks and have nothing to do with the actual flushing
    of the dirty meta-data buffers to the physical media.  The frontend
    always gets shared locks, the backend always gets exclusive locks.
    Some blocking between the frontend and backend occurs but not a whole
    lot.  The backend flusher actually blocks against itself more often
    (it runs multiple threads), due to the locality of reference within
    the B-Tree for nearby inodes.

:>     The cookies are 64 bits in DragonFly.  I'm not sure why Linux would
:>     still be using 32 bit cookies, file offsets are 64 bits so you
:>     should be able to use 64 bit cookies.
:
:It is not Linux that perpetrates this outrage, it is NVFS v2.  We can't
:just tell everybody that their NFS v2 clients are now broken.

    Oh, we don't care about NFSv2 all that much any more.  NFSv3 is the
    bare minimum.  NFSv2 is extremely old, nobody should be using it any
    more.  Even NFSv3 is getting fairly long in the tooth now.

:>     For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key
:>     and 32 bits is an iterator to deal with hash collisions.  Poof,
:>     problem solved.
:
:Which was my original proposal to solve the problem.  Then Ted told me
:about NFS v2 :-O
:
:Actually, NFS hands you a 62 bit cookie with the high bits of both s32
:parts unused.  NFS v2 gives you a 31 bit cookie.  Bleah.

    I'd recommend dropping support for NFSv2.  It is not really worth 
    supporting any more.   Does it even support 64 bit inodes? (I don't 
    remember), or 64 bit file offsets?  NFSv2 is garbage.

    You should be able to use 63 bits of the cookie, I don't know why
    you wouldn't use the high bit of the lsb 32 bit part.  There is no
    requirement that that bit be 0.  In fact, the RFC says the cookie is
    a 64 bit unsigned integer and you should be able to use all 64 bits.
    If linux is not allowing all 64 bits to be used then it's a serious
    bug in linux.  The cookies are supposed to be opaque, just like the
    file handle.

:>     which means that records in the B-Tree aren't even close to being
:>     in any sort of lexical order.  I will probably wind up changing
:>     the hash calculation a bit to get some lexical sorting information
:>     stuffed into it... should be doable since the hash keys are 64 bits.
:
:Yes, I noticed that.  Check out dx_hack_hash:
:
:   http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38
:..
:It distributed hashes of ascii strings very well for me, with few
:funnels.  It way outperformed some popular hashes like TEA.  Ted's cut
:down crytographic hash is yet more uniform but costs much more CPU.

    I like :-).  I'll probably end up with multiple directory hash formats
    too before this is over.  Fortunately I have room in the inode to
    specify the format the directory will use.

:...
:>     Well, the B-Tree fanout isn't actually that big a deal.  Remember
:>     HAMMER reblocks B-Tree nodes along with everything else.  B-Tree
:>     nodes in a traversal are going to wind up in the same 16K filesystem
:>     block.
:
:It affects the number of probes you have to do for the lookup.
:Binsearching hits one cache line per test while each node lookup hits a
:lot more.
:...
:It is not just the cache footprint but the time it takes to get your
:key tables into L1 cache.  To be sure, we are talking about small
:details here, but these small details can add up to a difference of a
:factor of two in execution speed.  Which maybe you don't care much about
:now that you are orders of magnitude faster than what came before ;-)

    I'm more worried about extra disk seeks, bloated UNDO records, and
    blockages due to B-Tree locking.  The fully cached stuff already
    performs extremely well and I don't feel that I have to eek out
    much more performance out of it.  And, frankly, the L1/L2 cache
    characteristics are a wash and I'm already burning a lot of inefficient
    data accesses by doing the power-of-2 search on the elements in the
    first place.

    Finding a better way to home in on the B-Tree element, aka getting
    that power-of-2 search out of the critical path... that would be worth
    doing.  Worrying about an occassional node traversal is way below
    the noise floor.

:Yes, I have thought about it a little more, and I imagine something
:like a small array of dirty bits that climb up the btree with the help
:of logical logging, where each bit means that there is something
:interesting for some corresponding replication target to look at,
:somewhere down in the subtree.

    You can do with one transaction id / mtime field in the B-Tree node
    header.  Strictly speaking you don't need a per-element field on
    internal elements, though I have one there too because I had the space.

:>     Not for directory entries.  I'm pretty sure Linux uses a throw-away
:>     namecache just like we do, so unless it has a mechanic for flush
:>     sequencing of dirty namecache entries you need in-VFS structural
:>     support for caching the operations.
:
:The Linux dentry cache actually implements proper namespace semantics
:all by itself without needing to flush anything, which is what Ramfs
:is.  Ramfs just lives entirely in cache without even being backed by
:a ramdisk, until the power goes off.

    That's cool.  You will want to flush it since you are backing the
    system caches with a real filesystem, and you would probably need
    a tracking structure of some sort anyway to deal with the 
    directory entry create/delete dependancies.

:>     In particular, caching a namespace deletion (rm, rename, etc)
:>     without touching the meta-data requires implementing a merged lookup
:>     so you can cancel-out the directory entry that still exists on-media.
:>     So it isn't entirely trivial.
:
:Linux implements "negative dentries" to say "this entry is not here".

    Yes, BSD's have those too (and always have), but like the positive
    entries they are a throw-away cache.

:>     blocks.  Lookups have to take into account the cached truncation offset
:>     so as not to read data from the media past the truncation point (e.g.
:>     if you seek-write-extend or truncate-extend the file immediately
:>     after truncating the file).
:
:I have not looked at that part for a while now, and I did not look at
:it just now, but Al Viro has been fiddling with it for years getting the
:bugs out one at a time.  The last major one I heard of was some time
:ago.  It works somehow, I should look more closely at it.

    I can believe that.  It took me a while to get cached truncation
    offsets working in HAMMER.  It's probably the most complex bit of
    code after the merged memory/media-B-Tree iterator.

:>     Jeeze, BSD's have been doing that forever.  That's what VOP_BMAP is
:>     used for.  I'm a little surprised that Linux doesn't do that yet.
:>     I'll expand on that down further.
:
:Either you forgot to expand or I missed it.  I am interested.

    Oops, I think I did forget to expand that.  Basically buffer cache
    buffers under kernel management are logically indexed.  Each vnode
    has its own tree of buffer cache buffers.

    When the kernel wants to flush dirty buffers or cluster adjacent
    buffers together for I/O it uses VOP_BMAP to map the logical offsets
    to physical offsets (so kernel-supported clustering is thus based
    on physical offsets).  The VFS is responsible for implementing the BMAP
    operation.  It is the BMAP operation which assigns actual physical
    media blocks to a high level logically mapped filesystem buffer.

    The mapping is persistent as long as the buffer cache buffer exists.
    Of course clean buffers can be thrown away by the kernel so at some
    point in the future the kernel might re-request the VOP_BMAP for
    that logical offset.

    The VFS can choose when to implement the physical mapping.  It can
    tell the kernel to take a hike and do it when the kernel actually
    writes the buffer out (being a logical buffer the flush goes through
    the VFS via VOP_STRATEGY), it can do it early on when the actual
    read() or write() opration occurs, etc.

:I remember it well.  You were the one who put Rik up the reverse map
:design that caused the main fireworks between 2.3 and 2.4.9.  (I was
:the one who finally got it work by social-engineering the merging of
:the reverse map part with Andrea's more robust LRU design.)
:
:I also remember the time when BSD buffers were far superior to Linux
:ones, yes.  In 2.0 the arrangement sucked enormously: sort-of coherency
:between the buffer cache and the page cache was achieved by physically
:copying changes from one to the other.
:
:Today, the buffer cache and page cache are nearly fully unified.  But
:the unification will never be complete until the page size is variable,
:so the remaining tasks done by buffers can be done by pages instead.
:
:Anyway, this part of your OS is more similar than different, which will
:help a lot with a port.

    Hey, I like the term fireworks.   The VM system is the most complex
    piece of the kernel, in both Linux and BSD.  I'm not surprised.

    BSDs have always had a unified buffer / VM cache, but until recently
    VFS's have had to have knowledge of the VM system to implement
    VOP_PUTPAGES and VOP_GETPAGES.  One of the things I did in DragonFly
    was finally finish that unification up so VFSs now only need to
    implement VOP_READ and VOP_WRITE, and do not have to implement 
    VOP_PUTPAGES and VOP_GETPAGES any more.

    BSD buffer cache buffers are backed by vm_page_array[], an array of
    VM pages, and the base of the buffer does not have to be page-aligned.
    So variable page sizes are not necessary.

    FreeBSD is about 80% of the way there.  They haven't taken the final
    steps yet.  I don't know about NetBSD and OpenBSD.

:Indeed, I need to handle that.  Since I do want to stick with a simple
:linear sequence of logical log commits for now, I do not want to leave
:any of them sitting around for a log time.  One easy thing to do is to
:put any orphans aside at rollup time, in an unversioned orphan file, say.

    I had a little time to think about it.  I think the easiest solution
    is, if you are forced to flush an open inode with 0 links, to actually
    record the open state in a separate B-Tree so upon remounting after a
    crash you can simply scan that B-Tree and finish off the inodes
    you find in that tree.  That way it would not have to be held in the
    log at all as anything other then a transitory entry.

:Linux LVM is all about device mapper, and actually, device mapper does
:not really have a reason to be a separate subsystem, it can just be the
:way the block layer works.  If interested, check out my posts on bio
:stacking and rolling device mapper's "clone_and_map" idea up into the
:generic block layer.  I think all unixen should do this.

     I'll look it up!

:Daniel
:

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Kernel mailing list