HAMMER update 10-Feb-2008

Sun Feb 10 19:24:17 PST 2008

:So -- in this instance you are both the establishment and the revolutionary at
:the same time.  Could you explain in extra-bonehead language what problems you
:were solving with the cluster model, and if you are still solving those problems
:with the newer model?
:
:Thanks!

    The three major pieces didn't integrate together well enough.  It's
    really that simple.  Individually they were great.  Together they
    were not.

    * Small (64MB) clusters with localized B-Trees.

      Plus: Recovering the filesystem from a massive failure not difficult.

      Minus: Data and records cannot roam about, they have to be placed in
      the particular cluster covering their key.

    * Fine-grained A-list allocator.

      Plus: Fine grained allocation and freeing.

      Minus: It didn't help the problem of dead space in the cluster, or
      for that matter help the issue of dealing with full clusters,
      because particular records and data had to go into particular
      clusters.

    * Per-cluster recovery mechanic.

      Plus: Very easy to recover a fixed portion of the B-Tree.

      Minus: Couldn't cross cluster boundaries.  Couldn't deal with
      transactions that crossed cluster boundaries.  Requires strict
      write ordering and for clusters to be marked 'open' (virtually
      synchronously it turned out) to even detect that recovery was needed.

      I still would have needed to implement an UNDO FIFO on top of
      everything else, and despite the per-buffer locality of reference
      the model had it still made an excessive number of little modifications
      all over the buffer that would have made the UNDO FIFO rather
      expensive.

    So I had a fine-grained allocator that couldn't contribute to solving
    the issue of what to do when clusters were too full or too empty, a
    recovery mechanic that couldn't guarantee consistency for transactions
    which spanned more then one cluster, and clusters that wound up being
    too small (64MB) and caused management I/O to become too dispersed.

    Add to that the fact that the model required an excessive number of
    special cases in-code.  Do a cvs diff just of the B-Tree and cursor
    code from before I ripped it out to now and you will see the new code
    is much, much cleaner, with virtually no special cases.

    --

    The new model works a lot a better.  Fine-grained allocation through
    what will effectively be a sequential (blockmap) index.  Extremely
    good localization and packing of data (even better then the cluster
    model had, and the cluster model was pretty good in that regard).
    The ability (due to the blockmap) to rearrange the physical location
    of blocks of nodes, records, and data, in 8 MB chunks, with no effort.
    Coarse grained freeing which is extremely fast (emphasis on extremely..
    it is ultra fast), but is still able to detect the 'all free' case
    for a big-block, for some immediate gratification when deleting or
    pruning large amounts of material.

    For the UNDO FIFO, which I would have had to write anyway, the new
    model generates far fewer dispersed modifications, resulting in
    a smaller UNDO stream.  A small UNDO stream reduces the number of 
    buffer dependancies to, in most cases, not more then one (i.e. 25 buffers
    may need 1 buffer to be committed to disk before they can be committed).

    The core allocator, next on my list to write, is literally just a
    big-block allocator, dealing with large 8M blocks and not having to
    deal with any fine-grained issues within those blocks.  You can't get
    much more straightforward then that!

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>