HAMMER2 update - Dec 10 2012
Matthew Dillon
dillon at apollo.backplane.com
Mon Dec 10 10:53:43 PST 2012
It's been a while since I last gave a status report for HAMMER2
development, so here it is.
Firstly, please remember that a huge amount of end-to-end work needs
to be done before any of the really big-ticket items become usable.
HAMMER2 will not be production-ready for a long time.
Most of what I have been doing the last few months has been on
the network cluster protocols. In order to test these protocols I
have also worked on a remote block device client and service which
utilizes them (whereby one machine's raw disks can show up
on another machine and survive temporary network disconnects and such).
Very significant progress has been made on the clustering messaging
protocols:
* The spanning tree works and self heals.
* Connection filtering for spanning tree advertisements works.
(the filtering simplifies the work the kernel messaging code must
do, which is important because I want the kernel-side of things
to be as simply-implemented as possible to make things more
portable).
* Virtual circuits are properly forged and (so far) appear to
properly disconnect on network disruption.
* Individual transactions can now span the cluster end-to-end over the
VCs and mostly appear to properly disconnect on network disruption.
* A lot of bug fixing has made the userland portion of the code
significantly more robust. Still lots to do here.
* With help from Alex Hornung the encryption has been significantly
beefed up.
* The API for the kernel support module for the cluster messaging
has been significantly beefed up and made more robust. It now
features a lot of rollup automation to make the device driver and
HAMMER2 VFS side easier.
* The kernel's disk subsystem is now able to export block devices
via the cluster messaging interface.
* And I have a mostly-working (but not production quality) client-side
to import block devices via the messaging interface.
I've been able to do a lot of testing of the messaging infrastructure
using the block device service and client so I know that certain major
requirements such as concurrent transactions (command parallelism)
and positive acknowledgement of failure conditions & disconnects now
work relatively well.
The HAMMER2 VFS is still where it was before... in need of a real
block allocator. The media format is solid and all VOPs work. There
is one known issue with the hardlink support related to non-trivial
renames of directories which move them to other areas of the directory
tree (and of course the lack of a block allocator). Both issues
will be addressed in 2013. Apart from those two issues the filesystem
is very stable. Again, please note that the filesystem is not usable
for anything real until it at least gets a working block allocator
(unless you like the idea of write-once-media).
Next Steps
There are two bits of low hanging (but still quite complex) that I will
be working on next:
(1) Implementing a block allocator for HAMMER2 would allow people to
begin using it more seriously, despite the lack of recovery features
(yet).
(2) Implementing one of the two HAMMER2 copies features. Specifically,
block redundancy implemented at the filesystem level (this is NOT
mirroring). This feature will allow HAMMER2 to record more than
one block and then select whichever of the available blocks is good
when recursing through indirect blocks. If one candidate fails
or has a CRC problem, HAMMER2 can continue to operate as long as
at least one of the copies is good.
(The second copies-style feature is the full blown
HAMMER2<->HAMMER2 clustering/quorum protocol which is still high
up on the tree ... not low hanging until cluster cache coherency
is implemented).
Implementing copies
Block redundancy at the HAMMER2 VFS level (not the block device level)
is actually fairly complex because one major feature that we want is
to be able to maintain the best performance possible in the face
of ANY media failure. To maintain performance and not stall
indefinitely in the face of a media or connectivity failure,
writes must still be retired (they can't be held in memory forever
on a production system because, of course, you will run out
of memory). That is, it has to be a queueless implementation.
Simple CRC failures are disturbing but easy to heal without losing
synchronization. Long-lasting network failures and/or local media
failures or media I/O errors is another situation entirely and in
those cases synchronization must be allowed to be lost. That is,
one must be able to continue to retire data to the copies that are
still available, and then catch-up the lost media when it comes back
online.
* If the lost media is still being caught-up after it comes back online
and your other copy or copies (all the others) go bad before it can
complete the resynchronization, the filesystem can no longer continue
to operate with full a consistency guarantee for currently running
programs.
At this point either rule-based or operator intervention is required
to select one of the copies still working as the new master. This
necessitates a remount and killing/restart any running programs that
expect consistency to be maintained, or rebooting the machine entirely.
(The key though is that you can get the system up and running again
and not just leave it stalled out all day or all week).
* Copies is not really multi-master. The root block for each copy
will have a reference to the root block of all the other copies,
and so on at each indirection level. When things are in sync
everything is fine, HAMMER2 would be able to use any of the root
blocks as its 'master'. When things get out of sync HAMMER2 must
choose one of the working copy's root blocks as the 'master'.
NOTE! In a normal crash/reboot situation where all the copies are
good, but not quite synchronized due to the crash, it would not matter
which copy HAMMER2 uses as its master for synchronization purposes
since any sync'd or fsync'd data will properly exist on all copies.
The complexity here is that HAMMER2 must always select one of the copies
as its master, because validation always starts at the root block.
As long as things are synchronized it can CHANGE this selection in
order to deal with failures. But once a failure occurs and things
become unsynchronized the fact that all changes must propagate to
the root means that continuing operations (which we do) will cause the
root block of the selected master to now desynchronize from the root
block of any failed copies, even if the actual differences between
the copies are deeper in the tree.
Thus the resynchronization code must revalidate the trees for all the
other copies, which it can do recursively with stops the moment it hits
other branches which are found to be fully synchronized. This is
optimal because it gives us a limited recursion that ultimately only
drills down to the branches that are actually desynchronized. It is
also how the resynchronization is able to operate 'queueless'. Continuing
write operations can only go to those copies that are validated up
to the point in the tree where the write occurs. The other copies
will catch-up as they are resynchronized.
--
Things are starting to get exciting now that the messaging is working.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Users
mailing list