HAMMER2 update - Dec 10 2012
Zenny
garbytrash at gmail.com
Tue Dec 11 01:43:44 PST 2012
Eagerly looking forward to in the year to come!
Thumbs up for your work, Matt!
Season's greetings to you and your team!
On 12/10/12, Matthew Dillon <dillon at apollo.backplane.com> wrote:
> It's been a while since I last gave a status report for HAMMER2
> development, so here it is.
>
> Firstly, please remember that a huge amount of end-to-end work needs
> to be done before any of the really big-ticket items become usable.
> HAMMER2 will not be production-ready for a long time.
>
> Most of what I have been doing the last few months has been on
> the network cluster protocols. In order to test these protocols I
> have also worked on a remote block device client and service which
> utilizes them (whereby one machine's raw disks can show up
> on another machine and survive temporary network disconnects and such).
>
> Very significant progress has been made on the clustering messaging
> protocols:
>
> * The spanning tree works and self heals.
>
> * Connection filtering for spanning tree advertisements works.
> (the filtering simplifies the work the kernel messaging code must
> do, which is important because I want the kernel-side of things
> to be as simply-implemented as possible to make things more
> portable).
>
> * Virtual circuits are properly forged and (so far) appear to
> properly disconnect on network disruption.
>
> * Individual transactions can now span the cluster end-to-end over the
> VCs and mostly appear to properly disconnect on network disruption.
>
> * A lot of bug fixing has made the userland portion of the code
> significantly more robust. Still lots to do here.
>
> * With help from Alex Hornung the encryption has been significantly
> beefed up.
>
> * The API for the kernel support module for the cluster messaging
> has been significantly beefed up and made more robust. It now
> features a lot of rollup automation to make the device driver and
> HAMMER2 VFS side easier.
>
> * The kernel's disk subsystem is now able to export block devices
> via the cluster messaging interface.
>
> * And I have a mostly-working (but not production quality) client-side
> to import block devices via the messaging interface.
>
> I've been able to do a lot of testing of the messaging infrastructure
> using the block device service and client so I know that certain major
> requirements such as concurrent transactions (command parallelism)
> and positive acknowledgement of failure conditions & disconnects now
> work relatively well.
>
> The HAMMER2 VFS is still where it was before... in need of a real
> block allocator. The media format is solid and all VOPs work. There
> is one known issue with the hardlink support related to non-trivial
> renames of directories which move them to other areas of the directory
> tree (and of course the lack of a block allocator). Both issues
> will be addressed in 2013. Apart from those two issues the filesystem
> is very stable. Again, please note that the filesystem is not usable
> for anything real until it at least gets a working block allocator
> (unless you like the idea of write-once-media).
>
> Next Steps
>
> There are two bits of low hanging (but still quite complex) that I will
> be working on next:
>
> (1) Implementing a block allocator for HAMMER2 would allow people to
> begin using it more seriously, despite the lack of recovery features
> (yet).
>
> (2) Implementing one of the two HAMMER2 copies features. Specifically,
> block redundancy implemented at the filesystem level (this is NOT
> mirroring). This feature will allow HAMMER2 to record more than
> one block and then select whichever of the available blocks is good
> when recursing through indirect blocks. If one candidate fails
> or has a CRC problem, HAMMER2 can continue to operate as long as
> at least one of the copies is good.
>
> (The second copies-style feature is the full blown
> HAMMER2<->HAMMER2 clustering/quorum protocol which is still high
> up on the tree ... not low hanging until cluster cache coherency
> is implemented).
>
> Implementing copies
>
> Block redundancy at the HAMMER2 VFS level (not the block device level)
> is actually fairly complex because one major feature that we want is
> to be able to maintain the best performance possible in the face
> of ANY media failure. To maintain performance and not stall
> indefinitely in the face of a media or connectivity failure,
> writes must still be retired (they can't be held in memory forever
> on a production system because, of course, you will run out
> of memory). That is, it has to be a queueless implementation.
>
> Simple CRC failures are disturbing but easy to heal without losing
> synchronization. Long-lasting network failures and/or local media
> failures or media I/O errors is another situation entirely and in
> those cases synchronization must be allowed to be lost. That is,
> one must be able to continue to retire data to the copies that are
> still available, and then catch-up the lost media when it comes back
> online.
>
> * If the lost media is still being caught-up after it comes back online
> and your other copy or copies (all the others) go bad before it can
> complete the resynchronization, the filesystem can no longer continue
> to operate with full a consistency guarantee for currently running
> programs.
>
> At this point either rule-based or operator intervention is required
> to select one of the copies still working as the new master. This
> necessitates a remount and killing/restart any running programs that
> expect consistency to be maintained, or rebooting the machine
> entirely.
> (The key though is that you can get the system up and running again
> and not just leave it stalled out all day or all week).
>
> * Copies is not really multi-master. The root block for each copy
> will have a reference to the root block of all the other copies,
> and so on at each indirection level. When things are in sync
> everything is fine, HAMMER2 would be able to use any of the root
> blocks as its 'master'. When things get out of sync HAMMER2 must
> choose one of the working copy's root blocks as the 'master'.
>
> NOTE! In a normal crash/reboot situation where all the copies are
> good, but not quite synchronized due to the crash, it would not
> matter
> which copy HAMMER2 uses as its master for synchronization purposes
> since any sync'd or fsync'd data will properly exist on all copies.
>
> The complexity here is that HAMMER2 must always select one of the
> copies
> as its master, because validation always starts at the root block.
> As long as things are synchronized it can CHANGE this selection in
> order to deal with failures. But once a failure occurs and things
> become unsynchronized the fact that all changes must propagate to
> the root means that continuing operations (which we do) will cause the
> root block of the selected master to now desynchronize from the root
> block of any failed copies, even if the actual differences between
> the copies are deeper in the tree.
>
> Thus the resynchronization code must revalidate the trees for all the
> other copies, which it can do recursively with stops the moment it hits
> other branches which are found to be fully synchronized. This is
> optimal because it gives us a limited recursion that ultimately only
> drills down to the branches that are actually desynchronized. It is
> also how the resynchronization is able to operate 'queueless'.
> Continuing
> write operations can only go to those copies that are validated up
> to the point in the tree where the write occurs. The other copies
> will catch-up as they are resynchronized.
>
> --
>
> Things are starting to get exciting now that the messaging is working.
>
> -Matt
> Matthew Dillon
> <dillon at backplane.com>
>
More information about the Users
mailing list