HAMMER2 update - Dec 10 2012

Mon Dec 10 10:53:43 PST 2012

    It's been a while since I last gave a status report for HAMMER2
    development, so here it is.

    Firstly, please remember that a huge amount of end-to-end work needs
    to be done before any of the really big-ticket items become usable.
    HAMMER2 will not be production-ready for a long time.

    Most of what I have been doing the last few months has been on
    the network cluster protocols.  In order to test these protocols I
    have also worked on a remote block device client and service which
    utilizes them (whereby one machine's raw disks can show up
    on another machine and survive temporary network disconnects and such).

    Very significant progress has been made on the clustering messaging
    protocols:

    * The spanning tree works and self heals.

    * Connection filtering for spanning tree advertisements works.
      (the filtering simplifies the work the kernel messaging code must
      do, which is important because I want the kernel-side of things
      to be as simply-implemented as possible to make things more
      portable).

    * Virtual circuits are properly forged and (so far) appear to
      properly disconnect on network disruption.

    * Individual transactions can now span the cluster end-to-end over the
      VCs and mostly appear to properly disconnect on network disruption.

    * A lot of bug fixing has made the userland portion of the code
      significantly more robust.  Still lots to do here.

    * With help from Alex Hornung the encryption has been significantly
      beefed up.

    * The API for the kernel support module for the cluster messaging 
      has been significantly beefed up and made more robust.  It now
      features a lot of rollup automation to make the device driver and
      HAMMER2 VFS side easier.

    * The kernel's disk subsystem is now able to export block devices
      via the cluster messaging interface.

    * And I have a mostly-working (but not production quality) client-side
      to import block devices via the messaging interface.

    I've been able to do a lot of testing of the messaging infrastructure
    using the block device service and client so I know that certain major
    requirements such as concurrent transactions (command parallelism)
    and positive acknowledgement of failure conditions & disconnects now
    work relatively well.

    The HAMMER2 VFS is still where it was before... in need of a real
    block allocator.  The media format is solid and all VOPs work.  There
    is one known issue with the hardlink support related to non-trivial
    renames of directories which move them to other areas of the directory
    tree (and of course the lack of a block allocator).  Both issues
    will be addressed in 2013.  Apart from those two issues the filesystem
    is very stable.  Again, please note that the filesystem is not usable
    for anything real until it at least gets a working block allocator
    (unless you like the idea of write-once-media).

				Next Steps

    There are two bits of low hanging (but still quite complex) that I will
    be working on next:

    (1) Implementing a block allocator for HAMMER2 would allow people to
	begin using it more seriously, despite the lack of recovery features
	(yet).

    (2) Implementing one of the two HAMMER2 copies features.  Specifically,
	block redundancy implemented at the filesystem level (this is NOT
	mirroring).  This feature will allow HAMMER2 to record more than
	one block and then select whichever of the available blocks is good
	when recursing through indirect blocks.  If one candidate fails
	or has a CRC problem, HAMMER2 can continue to operate as long as
	at least one of the copies is good.

	(The second copies-style feature is the full blown
	 HAMMER2<->HAMMER2 clustering/quorum protocol which is still high
	 up on the tree ... not low hanging until cluster cache coherency
	 is implemented).

			    Implementing copies

    Block redundancy at the HAMMER2 VFS level (not the block device level)
    is actually fairly complex because one major feature that we want is
    to be able to maintain the best performance possible in the face 
    of ANY media failure.  To maintain performance and not stall
    indefinitely in the face of a media or connectivity failure,
    writes must still be retired (they can't be held in memory forever
    on a production system because, of course, you will run out
    of memory).  That is, it has to be a queueless implementation.

    Simple CRC failures are disturbing but easy to heal without losing
    synchronization.  Long-lasting network failures and/or local media
    failures or media I/O errors is another situation entirely and in
    those cases synchronization must be allowed to be lost.  That is,
    one must be able to continue to retire data to the copies that are
    still available, and then catch-up the lost media when it comes back
    online.

    * If the lost media is still being caught-up after it comes back online
      and your other copy or copies (all the others) go bad before it can
      complete the resynchronization, the filesystem can no longer continue
      to operate with full a consistency guarantee for currently running
      programs.

      At this point either rule-based or operator intervention is required
      to select one of the copies still working as the new master.  This
      necessitates a remount and killing/restart any running programs that
      expect consistency to be maintained, or rebooting the machine entirely.
      (The key though is that you can get the system up and running again
      and not just leave it stalled out all day or all week).

    * Copies is not really multi-master.  The root block for each copy
      will have a reference to the root block of all the other copies,
      and so on at each indirection level.  When things are in sync
      everything is fine, HAMMER2 would be able to use any of the root
      blocks as its 'master'.  When things get out of sync HAMMER2 must
      choose one of the working copy's root blocks as the 'master'.

      NOTE!  In a normal crash/reboot situation where all the copies are
      good, but not quite synchronized due to the crash, it would not matter
      which copy HAMMER2 uses as its master for synchronization purposes
      since any sync'd or fsync'd data will properly exist on all copies.

    The complexity here is that HAMMER2 must always select one of the copies
    as its master, because validation always starts at the root block.
    As long as things are synchronized it can CHANGE this selection in
    order to deal with failures.  But once a failure occurs and things
    become unsynchronized the fact that all changes must propagate to
    the root means that continuing operations (which we do) will cause the
    root block of the selected master to now desynchronize from the root
    block of any failed copies, even if the actual differences between
    the copies are deeper in the tree.

    Thus the resynchronization code must revalidate the trees for all the
    other copies, which it can do recursively with stops the moment it hits
    other branches which are found to be fully synchronized.  This is
    optimal because it gives us a limited recursion that ultimately only
    drills down to the branches that are actually desynchronized.  It is
    also how the resynchronization is able to operate 'queueless'.  Continuing
    write operations can only go to those copies that are validated up
    to the point in the tree where the write occurs.  The other copies
    will catch-up as they are resynchronized.

    --

    Things are starting to get exciting now that the messaging is working.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>