HAMMER update 06-Feb-2008

Wed Feb 6 14:02:49 PST 2008

    Work continues to progress well but I've hit a few snags:

    * My current recovery model is just not working for cross-cluster
      transactions.

    * I can't fix the write performance issue due to the way the cluster
      localization and spike code works.

    * Handling nearly full filesystems is proving to be a problem due to
      the inability to estimate space requirements which in turn are
      related to the cluster localization the current model uses.

    * Efficient real-time mirroring is just not easy to do w/ the current
      cluster-localized model.

    Everything else now works, and works well, including and most especially
    the historical access features.  I've already fallen in love with being
    able to create a softlink to a '@@0x<timestamp>' string to access
    snapshots.

    I've come to the conclusion that I am going to have to make a fairly
    radical change to the on-disk structures to solve these problems.  On
    the plus side, these changes will greatly simplify the filesystem topology
    and greatly reduce its complexity.  On the negative side, recovering
    space will not be instantanious and will basically require data to
    be copied from one part of the filesystem to another.  My take is that
    since large filesystems (our target) do not usually fill up very quickly,
    and indeed it can take hours even if you are writing 60MB/sec to the disk,
    It is well worth taking a hit on free space recovery if it accomplishes
    all of the filesystem's other goals.

    This may seem a bit radical as solutions go, but the more I look at it
    the more I like it because it really does neatly tie up every single 
    remaining issue.

    * Get rid of super-clusters, clusters, and typed buffers.  Get rid of them
      entirely.  Just have an array of buffers after the volume header.

    * Get rid of the A-list's (the recursive bitmap radix tree allocation
      model).  Poof, gone.  All of them, gone.  Have no random block
      allocation model at all.

    * Get rid of the per-cluster B-Tree's.  Instead have just one global
      B-Tree for the entire filesystem, able to access any record anywhere
      in the filesystem (currently a per-cluster B-Tree can only access
      information within that cluster).

    * Implement the filesystem as one big huge circular FIFO, pretty much
      laid down linearly on disk, with a B-Tree to locate and access data.

    * Random modifications (required for manipulating the B-Tree and marking
      records as deleted) will append undo records to this FIFO and the only
      write ordering requirement will be that the buffers containing these
      undo records be committed before the buffers containing the random 
      modifications are committed.

      This completely solves the crash recovery issue, regardless of the
      size of a transaction.  If you crash half way through doing a 200MB
      write() it will be able to unwind the whole thing.

      This (plus the B-Tree and layout changes) completely solves the
      write performance issue.

    * Physical recovery of 'disk space' in the filesystem requires advancing
      the circular FIFO's beginning index, which means copying data we
      wish to retain from the front of the FIFO to the end and discarding
      data at the head that we wish to destroy.

      This is the one big gotcha to the model, but the key point here is
      that the copying can occur off-peak, or whenever disk bandwidth is
      available, and can operate on large linear swaths of disk at once
      (with some random async writes for B-Tree fixups).

    * Since information is laid down linearly, on-disk space can be reserved
      for in-memory buffer and record caches, solving the issue of how
      to deal with nearly full filesystems.

    * Ultimately physical recovery can be optimized by introducing a
      large-block allocation model 'behind' the circular FIFO abstraction.

      This will solve the issue of dealing with bad blocks on the disk.

    * The filesystem can be recovered from horrible data loss basically by
      doing a copy-scan to first remove all auxillary data (like B-Tree nodes
      and older undo records that are no longer needed), and then regenerating
      the B-Tree from scratch for the entire filesystem.

      This would of course take a long time but it would only be needed to
      recover data after a head crash, where physical corruption might be
      present.

    * This also neatly solves real-time mirroring issues.  A real-time mirror
      can simply track the FIFO linearly, and can record its FIFO index if
      interrupted to pick up where it left off later on.  You can't get
      much better then that.

      I had given up on efficient real time mirroring when I settled on the
      localized cluster model.  Now its back.

    It will probably take a week or two to rewire the topology and another
    week or two to debug it.  Despite the massive rewiring, the new model
    is much, MUCH simpler then the old, and all the B-Tree code is retained
    (just extended to operate across the entire filesystem instead of just
    within a single cluster).

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>