DESIGN document for HAMMER2 (08-Feb-2012 update)

Wed Feb 8 19:20:04 PST 2012

    This is the current design document for HAMMER2.  It lists every feature
    I intend to implement for HAMMER2.  Everything except the freemap and
    cluster protocols (which are both big ticket items) has been completely
    speced out.

    There are many additional features verses the original document,
    including hardlinks.

    HAMMER2 is all I am working on this year so I expect to make good
    progress, but it will probably still be July before we have anything
    usable, and well into 2013 before the whole mess is implemented and
    even later before the clustering is 100% stable.

    However, I expect to be able to stabilize all non-cluster related features
    in fairly short order.  Even though HAMMER2 has a lot more features then
    HAMMER1 the actual design is simpler than HAMMER1, with virtually no edge
    cases to worry about (I spent 12+ months working edge cases out in
    HAMMER1's B-Tree, for example... that won't be an issue for HAMMER2
    development).

    The work is being done in the 'hammer2' branch off the main dragonfly
    repo in appropriate subdirs.  Right now just vsrinivas and I but
    hopefully enough will get fleshed out in a few months that other people
    can help too.

    Ok, here's what I have got.

			    HAMMER2 DESIGN DOCUMENT

			        Matthew Dillon
				 08-Feb-2012
			     dillon at backplane.com

* These features have been speced in the media structures.

* Implementation work has begun.

* A working filesystem with some features implemented is expected by July 2012.

* A fully functional filesystem with most (but not all) features is expected
  by the end of 2012.

* All elements of the filesystem have been designed except for the freemap
  (which isn't needed for initial work).  8MB per 2GB of filesystem
  storage has been reserved for the freemap.  The design of the freemap
  is expected to be completely speced by mid-year.

* This is my only project this year.  I'm not going to be doing any major
  kernel bug hunting this year.

				Feature List

* Multiple roots (allowing snapshots to be mounted).  This is implemented
  via the super-root concept.  When mounting a HAMMER2 filesystem you specify
  a device path and a directory name in the super-root.

* HAMMER1 had PFS's.  HAMMER2 does not.  Instead, in HAMMER2 any directory
  in the tree can be configured as a PFS, causing all elements recursively
  underneath that directory to become a part of that PFS.

* Writable snapshots.  Any subdirectory tree can be snapshotted.  Snapshots
  show up in the super-root.  It is possible to snapshot a subdirectory
  and then later snapshot a parent of that subdirectory... really there are
  no limitations here.

* Directory sub-hierarchy based quotas and space and inode usage tracking.
  Any directory sub-tree, whether at a mount point or not, tracks aggregate
  inode use and data space use.  This is stored in the directory inode all
  the way up the chain.

* Incremental queueless mirroring / mirroring-streams.  Because HAMMER2 is
  block-oriented and copy-on-write each blockref tracks both direct
  modifications to the referenced data via (modify_tid) and indirect
  modifications to the referenced data or any sub-tree via (mirror_tid).
  This makes it possible to do an incremental scan of meta-data that covers
  only changes made since the mirror_tid recorded in a prior-run.

  This feature is also intended to be used to locate recently allocated
  blocks and thus be able to fixup the freemap after a crash.

  HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
  that HAMMER2 does not keep track of 'deleted' records.  Instead any
  recursion by the mirroring code which finds that (modify_tid) has
  been updated must also send the direct block table or indirect block
  table state it winds up recursing through so the target can check
  similar key ranges and locate elements to be deleted.  This can be
  avoided if the mirroring stream is mostly caught up in that very recent
  deletions will be cached in memory and can be queried, allowing shorter
  record deletions to be passed in the stream instead.

* Will support multiple compression algorithms configured on subdirectory
  tree basis and on a file basis.  Up to 64K block compression will be used.
  Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1,
  4:1, 8:1, etc) will work in this scheme because physical block allocations
  in HAMMER2 are always power-of-2.

  Compression algorithm #0 will mean no compression and no zero-checking.
  Compression algorithm #1 will mean zero-checking but no other compression.
  Real compression will be supported starting with algorithm 2.

* Zero detection on write (writing all-zeros), which requires the data
  buffer to be scanned, will be supported as compression algorithm #1.
  This allows the writing of 0's to create holes and will be the default
  compression algorithm for HAMMER2.

* Copies support for redundancy.  The media blockref structure would
  have become too bloated but I found a clean way to do copies using the
  blockset structure (which is a set of 8 fully associative blockref's).

  The design is such that the filesystem should be able to function at
  full speed even if disks are pulled or inserted, as long as at least one
  good copy is present.  A background task will be needed to resynchronize
  missing copies (or remove excessive copies in the case where the copies
  value is reduced on a live filesystem).

* Intended to be clusterable, with a multi-master protocol under design
  but not expected to be fully operational until mid-2013.  The media
  format for HAMMER1 was less condusive to logical clustering than I had
  hoped so I was never able to get that aspect of my personal goals
  working with HAMMER1.  HAMMER2 effectively solves the issues that cropped
  up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical
  file/directory hierarchy, making cache coherency very difficult).

* Hardlinks will be supported.  All other standard features will be supported
  too of course.  Hardlinks in this sort of filesystem require significant
  work.

* The media blockref structure is now large enough to support up to a 192-bit
  check value, which would typically be a cryptographic hash of some sort.
  Multiple check value algorithms will be supported with the default being
  a simple 32-bit iSCSI CRC.

* Fully verified deduplication will be supported and automatic (and
  necessary in many respects).

* Non-verified de-duplication will be supported as a configurable option on
  a file or subdirectory tree.  Non-verified deduplication would use the
  largest available check code (192 bits) and not bother to verify data
  matches during the dedup pass, which is necessary on extremely large
  filesystems with a great deal of deduplicable data (as otherwise a large
  chunk of the media would have to be read to implement the dedup).

  This feature is intended only for those files where occassional corruption
  is ok, such as in a large data store of farmed web content.

				GENERAL DESIGN

HAMMER2 generally implements a copy-on-write block design for the filesystem,
which is very different from HAMMER1's B-Tree design.  Because the design
is copy-on-write it can be trivially snapshotted simply by referencing an
existing block, and because the media structures logically match a standard
filesystem directory/file hierarchy snapshots and other similar operations
can be trivially performed on an entire subdirectory tree at any level in
the filesystem.

The copy-on-write nature of the filesystem implies that any modification
whatsoever will have to eventually synchronize new disk blocks all the way
to the super-root of the filesystem and the volume header itself.  This forms
the basis for crash recovery.  All disk writes are to new blocks except for
the volume header, thus allowing all writes to run concurrently except for
the volume header update at the end.

Clearly this method requires intermediate modifications to the chain to be
cached so multiple modifications can be aggregated prior to being
synchronized.  One advantage, however, is that the cache can be flushed at
any time WITHOUT having to allocate yet another new block when further
modifications are made as long as the volume header has not yet been flushed.
This means that buffer cache overhead is very well bounded and can handle
filesystem operations of any complexity even on boxes with very small amounts
of physical memory.

I intend to implement a shortcut to make fsync()'s run fast, and that is to
allow deep updates to blockrefs to shortcut to auxillary space in the
volume header to satisfy the fsync requirement.  The related blockref is
then recorded when the filesystem is mounted after a crash and the update
chain is reconstituted when a matching blockref is encountered again during
normal operation of the filesystem.

Basically this means that no real work needs to be done at mount-time
even after a crash.

Directories are hashed, and another major design element is that directory
entries ARE INODES.  They are one and the same.  In addition to directory
entries being inodes the data for very small files (512 bytes or smaller)
can be directly embedded in the inode (overloaded onto the same space that
the direct blockref array uses).  This should result in very high
performance.

Inode numbers are not spatially referenced, which complicates NFS servers
but doesn't complicate anything else.  The inode number is stored in the
inode itself, an absolutely necessary feature in order to support the
hugely flexible snapshots that we want to have in HAMMER2.

				  HARDLINKS

Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
a spatial reference to the inode number.  We do not want to have to have
an index of inode numbers for any basic HAMMER2 feature if we can help it.

Hardlinks are handled by placing the inode for a multiply-hardlinked file
in the closest common parent directory.  If "a/x" and "a/y" are hardlinked
the inode for the hardlinked file will be placed in directory "a", e.g.
"a/3239944", but it will be invisible and will be in an out-of-band namespace.
The directory entries "a/x" and "a/y" will be given the same inode number
but in fact just be placemarks that cause HAMMER2 to recurse upwards through
the directory tree to find the invisible inode number.

Because directories are hashed and a different namespace (hash key range)
is used for hardlinked inodes, standard directory scans are able to trivially
skip this invisible namespace and inode-specific lookups can restrict their
lookup to within this space.

The nature of snapshotting makes handling link-count 2->1 and 1->2 cases
trivial.  Basically the inode media structure is copied as needed to break-up
or re-form the standard directory entry/inode.  There are no backpointers in
HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so
it is an utterly trivial operation.

				FREEMAP NOTES

In order to implement fast snapshots (and writable snapshots for that
matter), HAMMER2 does NOT ref-count allocations.  The freemap which
is still under design just won't do that.  All the freemap does is
keep track of 100% free blocks.

This not only trivializes all the snapshot features it also trivializes
hardlink handling and solves the problem of keeping the freemap sychronized
in the event of a crash.  Now all we have to do after a crash is make
sure blocks allocated before the freemap was flushed are properly
marked as allocated in the allocmap.  This is a trivial exercise using the
same algorithm the mirror streaming code uses (which is very similar to
HAMMER1)... an incremental meta-data scan that covers only the blocks that
might have been allocated between the last allocation map sync and now.

Thus the freemap does not have to be synchronized during a fsync().

The complexity is in figuring out what can be freed... that is, when one
can mark blocks in the freemap as being free.  HAMMER2 implements this as
a background task which essentially must scan available meta-data to
determine which blocks are not being referenced.

Part of the ongoing design work is finding ways to reduce the scope of this
meta-data scan so the entire filesystem's meta-data does not need to be
scanned (though in tests with HAMMER1, even full meta-data scans have
turned out to be fairly low cost).  In other words, its an area that we
can continue to improve on as the filesystem matures.  Not only that, but
we can completely change the freemap algorithms without creating
incompatibilities (at worse simply having to require that a R+W mount do
a full meta-data scan when upgrading or downgrading the freemap algorithm).

				  CLUSTERING

Clustering, as always, is the most difficult bit but we have some advantages
with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
structures generally follow the kernel's filesystem hiearchy.  Second,
HAMMER2's writable snapshots make it possible to implement several forms
of multi-master clustering.

The general mechanics for most of the multi-master clustering implementations
will be as follows:

    (a) Use the copies mechanism to specify all elements of the cluster,
	both local and remote (networked).

    (b) The core synchronization state operates just as it does for copies,
	simply requiring a fully-flushed ack from the remote in order to
	mark the blocks as having been fully synchronized.

	The mirror_tid may be used to locate these blocks, allowing the
	synchronization state to be updated on the fly at a much later
	time without requiring the state to be maintained in-memory.
	(also for crash recovery resynchronization purposes).

    (c) Data/meta-data can be retrieved from those copies which are marked
	as being synchronized, with priority given to the local storage
	relative to any given physical machine.

	This means that e.g. even in a master-slave orientation the slave
	may be able to satisfy a request from a program when the slave
	happens to be the local storage.

    (d) Transaction id synchronization between all elements of the cluster,
	typically through masking (assigning a cluster number using the low
	3 bits of the transaction id).

    (e) General access (synchronized or otherwise) may require cache
	coherency mechanisms to run over the network.

	Implementing cache coherency is a major complexity issue.

    (f) General access (synchronized or otherwise) may require quorum
	agreement, using the synchronization flags in the blockrefs
	to determine whether agreement has been reached.

	Implementing quorum voting is a major complexity issue.

There are lots of ways to implement multi-master environments using the
above core features but the implementation is going to be fairly complex
even with HAMMER2's feature set.

Keep in mind that modifications propagate all the way to the super-root
and volume header, so in any clustered arrangement the use of (modify_tid)
and (mirror_tid) is critical in determining the synchronization state of
portion(s) of the filesystem.

Specifically, since any modification propagates to the root the (mirror_tid)
in higher level directories is going to be in a constant state of flux.  This
state of flux DOES NOT invalidate the cache state for these higher levels
of directories.  Instead, the (modify_tid) is used on a node-by-node basis
to determine cache state at any given level, and (mirror_tid) is used to
determine whether any recursively underlying state is desynchronized.

* Simple semi-synchronized multi-master environment.

    In this environment all nodes are considered masters and modifications
    can be made on any of them, and then propagate to the others
    asynchronously via HAMMER2 mirror streams.  One difference here is
    that kernel can activate these userland-managed streams automatically
    when the copies configuration is used to specify the cluster.

    The only type of conflict which isn't readily resolvable by comparing
    the (modify_tid) is when file data is updated.  In this case user
    intervention might be required but, theoretically, it should be
    possible to automate most merges using a multi-way patch and, if not,
    choosing one and creating backup copies if the others to allow the
    user or sysop to resolve the conflict later.

* Simple fully synchronized fail-over environment.

    In this environment there is one designated master and the remaining
    nodes are slaves.  If the master fails all remaining nodes agree on a
    new master, possibly with the requirement that a quorum be achieved
    (if you don't want to allow the cluster to split).

    If network splits are allowed the each sub-cluster operates in this
    mode but recombining the clusters reverts to the first algorithm.
    If not allowed whomever no longer has a quorum will be forced to stall.

    In this environment the current designated master is responsible for
    managing locks for modifying operations.  The designated master will
    proactively tell the other nodes to mark the blocks related to the
    modifying operation as no longer being synchronized while any local
    data at the node that acquired the lock (master or slave) remains
    marked as being synchronized.

    The node that succesfully gets the lock then issues the modifying
    operation to both its local copy and to the master, marking the
    master as being desynchronized until the master acknowledges receipt.

    In this environment any node can access data from local storage if
    the designated master copy is marked synchronized AND its (modify_tid)
    matches the slave copy's (modify_tid).

    However, if a slave disconnects from the master then reconnects the
    slave will have lost the master's desynchronization stream and must
    mark its root blockref for the master copy HAMMER2_BREF_DESYNCHLD as
    well as clear the SYNC1/SYNC2 bits.  Setting DESYNCCHLD forces on-demand
    recursive reverification that the master and slave are (or are not) in
    sync in order to reestablish on the slave the synchronization state of
    the master.

    That might be a bit confusing but the whole point here is to allow
    read accesses to the filesystem to be satisfied by any node in a
    multi-master cluster, not just by the current designated master.

* Fully cache coherent and synchronized multi-master environment.

    In this environment a quorum is required to perform any modifying
    action.  All nodes are masters (there is no 'designated' master)
    and all nodes connect to all other nodes in a cross-bar.

    The quorum is specified by copies setup in the root volume configuration.
    A quorum of nodes in the cluster must agree on the copies configuration.
    If they do not the cluster cannot proceed to mount.  Any other nodes
    not in the quorum which are in the cluster which disagree with the
    configuration will inherit the copies configuration from the quorum.

    Any modifying action will initiate a lock request locally to all nodes
    in the cluster.  The modifying action is allowed to proceed the instant
    a quorum of nodes respond in the affirmative (even if some have not
    yet responded or are down).  The modifying action is considered complete
    once the two-phase commit protocol succeeds.  The modifying action
    typically creates and commits a temporary snapshot on at least a quorum
    of masters as phase-1 and then ties the snapshot back into the main
    mount as phase-2.

    These locks are cache-coherency locks and may be passively maintained
    in order to aggregate multiple operations under the same lock and thus
    under the same transaction from the point of view of the rest of the
    quorum.

    A lock request which interferes with a passively maintained lock will
    force the two-phase commit protocol to complete and then transfer
    ownership to the requesting entity, thus avoiding having to deal with
    deadlock protocols at this point in the state machine.

    Since any node can initiate concurrent lock requests to many other nodes
    it is possible to deadlock.  When two nodes initiate conflicting lock
    requests to the cluster the one achieving the quorum basically wins and
    the other is forced to retry (go back one paragraph).  In this situation
    no deadlock will occur.

    If three are more nodes initiate conflicting lock requests to the
    cluster a deadlock can occur whereby none of the nodes achieve a quorum.
    In this case every node will know which of the other nodes was granted
    the lock(s).  Deadlock resolution then proceeds simultaniously on the
    three nodes (since they have the same information), whereby the lock
    holders on the losing end of the algorithm transfer their locks to one
    of the other nodes.  The lock state and knowledge of the lock state is
    updated in real time on all nodes until a quorum is achieved.

* Fully cache coherent and synchronized multi-master environment with
  passive read locking.

    This is a more complex form of clustering than the previous form.
    Take the previous form and add the ability to passively hold SHARED
    locks in addition to the EXCLUSIVE locks the previous form is able
    to hold.

    The advantage of being able to passively hold a shared lock on a sub-tree
    (locks can be held on single nodes or entire sub-trees) is that it is
    then possible for all nodes to validate a node (modify_tid) or entire
    sub-tree (mirror_tid) with a very short network transaction and then
    satisfy a large number of requests from local storage.

* Fully cache coherent and synchronized multi-master environment with
  passive read locking and slave-only nodes.

    This is the MOST complex form of clustering we intend to support.
    In a multi-master environment requiring a quorum of masters to operate
    we implement all of the above plus ALSO allow additional nodes to be
    added to the cluster as slave-only nodes.

    The difference between a slave-only node and setting up a manual
    mirror-stream from the cluster to a read-only snapshot on another
    HAMMER2 filesystem is that the slave-only node will be fully
    cache coherent with either the cluster proper (if connected to a quorum
    of masters), or to one or more other nodes in the cluster (if not
    connected to a quorum of masters), EVEN if the slave itself is not
    completely caught up.

    So if the slave-only cluster node is connected to the rest of the cluster
    over a slow connection you basically get a combination of local disk
    speeds for any data that is locally in sync and network-limited speeds
    for any data that is not locally in sync.

    slave-only cluster nodes run a standard mirror-stream in the background
    to pull in the data as quickly as possible.

    This is in constrast to a manual mirror-stream to a read-only
    snapshot (basically a simple slave), which has no ability to bypass
    the local storage to handle out-of-date requests (in fact has no ability
    to detect that the local storage is out-of-date anyway).

						-Matt