HAMMER2 update - Dec 10 2012

Tue Dec 11 01:43:44 PST 2012

Eagerly looking forward to in the year to come!

Thumbs up for your work, Matt!

Season's greetings to you and your team!

On 12/10/12, Matthew Dillon <dillon at apollo.backplane.com> wrote:
>     It's been a while since I last gave a status report for HAMMER2
>     development, so here it is.
>
>     Firstly, please remember that a huge amount of end-to-end work needs
>     to be done before any of the really big-ticket items become usable.
>     HAMMER2 will not be production-ready for a long time.
>
>     Most of what I have been doing the last few months has been on
>     the network cluster protocols.  In order to test these protocols I
>     have also worked on a remote block device client and service which
>     utilizes them (whereby one machine's raw disks can show up
>     on another machine and survive temporary network disconnects and such).
>
>     Very significant progress has been made on the clustering messaging
>     protocols:
>
>     * The spanning tree works and self heals.
>
>     * Connection filtering for spanning tree advertisements works.
>       (the filtering simplifies the work the kernel messaging code must
>       do, which is important because I want the kernel-side of things
>       to be as simply-implemented as possible to make things more
>       portable).
>
>     * Virtual circuits are properly forged and (so far) appear to
>       properly disconnect on network disruption.
>
>     * Individual transactions can now span the cluster end-to-end over the
>       VCs and mostly appear to properly disconnect on network disruption.
>
>     * A lot of bug fixing has made the userland portion of the code
>       significantly more robust.  Still lots to do here.
>
>     * With help from Alex Hornung the encryption has been significantly
>       beefed up.
>
>     * The API for the kernel support module for the cluster messaging
>       has been significantly beefed up and made more robust.  It now
>       features a lot of rollup automation to make the device driver and
>       HAMMER2 VFS side easier.
>
>     * The kernel's disk subsystem is now able to export block devices
>       via the cluster messaging interface.
>
>     * And I have a mostly-working (but not production quality) client-side
>       to import block devices via the messaging interface.
>
>     I've been able to do a lot of testing of the messaging infrastructure
>     using the block device service and client so I know that certain major
>     requirements such as concurrent transactions (command parallelism)
>     and positive acknowledgement of failure conditions & disconnects now
>     work relatively well.
>
>     The HAMMER2 VFS is still where it was before... in need of a real
>     block allocator.  The media format is solid and all VOPs work.  There
>     is one known issue with the hardlink support related to non-trivial
>     renames of directories which move them to other areas of the directory
>     tree (and of course the lack of a block allocator).  Both issues
>     will be addressed in 2013.  Apart from those two issues the filesystem
>     is very stable.  Again, please note that the filesystem is not usable
>     for anything real until it at least gets a working block allocator
>     (unless you like the idea of write-once-media).
>
> 				Next Steps
>
>     There are two bits of low hanging (but still quite complex) that I will
>     be working on next:
>
>     (1) Implementing a block allocator for HAMMER2 would allow people to
> 	begin using it more seriously, despite the lack of recovery features
> 	(yet).
>
>     (2) Implementing one of the two HAMMER2 copies features.  Specifically,
> 	block redundancy implemented at the filesystem level (this is NOT
> 	mirroring).  This feature will allow HAMMER2 to record more than
> 	one block and then select whichever of the available blocks is good
> 	when recursing through indirect blocks.  If one candidate fails
> 	or has a CRC problem, HAMMER2 can continue to operate as long as
> 	at least one of the copies is good.
>
> 	(The second copies-style feature is the full blown
> 	 HAMMER2<->HAMMER2 clustering/quorum protocol which is still high
> 	 up on the tree ... not low hanging until cluster cache coherency
> 	 is implemented).
>
> 			    Implementing copies
>
>     Block redundancy at the HAMMER2 VFS level (not the block device level)
>     is actually fairly complex because one major feature that we want is
>     to be able to maintain the best performance possible in the face
>     of ANY media failure.  To maintain performance and not stall
>     indefinitely in the face of a media or connectivity failure,
>     writes must still be retired (they can't be held in memory forever
>     on a production system because, of course, you will run out
>     of memory).  That is, it has to be a queueless implementation.
>
>     Simple CRC failures are disturbing but easy to heal without losing
>     synchronization.  Long-lasting network failures and/or local media
>     failures or media I/O errors is another situation entirely and in
>     those cases synchronization must be allowed to be lost.  That is,
>     one must be able to continue to retire data to the copies that are
>     still available, and then catch-up the lost media when it comes back
>     online.
>
>     * If the lost media is still being caught-up after it comes back online
>       and your other copy or copies (all the others) go bad before it can
>       complete the resynchronization, the filesystem can no longer continue
>       to operate with full a consistency guarantee for currently running
>       programs.
>
>       At this point either rule-based or operator intervention is required
>       to select one of the copies still working as the new master.  This
>       necessitates a remount and killing/restart any running programs that
>       expect consistency to be maintained, or rebooting the machine
> entirely.
>       (The key though is that you can get the system up and running again
>       and not just leave it stalled out all day or all week).
>
>     * Copies is not really multi-master.  The root block for each copy
>       will have a reference to the root block of all the other copies,
>       and so on at each indirection level.  When things are in sync
>       everything is fine, HAMMER2 would be able to use any of the root
>       blocks as its 'master'.  When things get out of sync HAMMER2 must
>       choose one of the working copy's root blocks as the 'master'.
>
>       NOTE!  In a normal crash/reboot situation where all the copies are
>       good, but not quite synchronized due to the crash, it would not
> matter
>       which copy HAMMER2 uses as its master for synchronization purposes
>       since any sync'd or fsync'd data will properly exist on all copies.
>
>     The complexity here is that HAMMER2 must always select one of the
> copies
>     as its master, because validation always starts at the root block.
>     As long as things are synchronized it can CHANGE this selection in
>     order to deal with failures.  But once a failure occurs and things
>     become unsynchronized the fact that all changes must propagate to
>     the root means that continuing operations (which we do) will cause the
>     root block of the selected master to now desynchronize from the root
>     block of any failed copies, even if the actual differences between
>     the copies are deeper in the tree.
>
>     Thus the resynchronization code must revalidate the trees for all the
>     other copies, which it can do recursively with stops the moment it hits
>     other branches which are found to be fully synchronized.  This is
>     optimal because it gives us a limited recursion that ultimately only
>     drills down to the branches that are actually desynchronized.  It is
>     also how the resynchronization is able to operate 'queueless'.
> Continuing
>     write operations can only go to those copies that are validated up
>     to the point in the tree where the write occurs.  The other copies
>     will catch-up as they are resynchronized.
>
>     --
>
>     Things are starting to get exciting now that the messaging is working.
>
> 					-Matt
> 					Matthew Dillon
> 					<dillon at backplane.com>
>