HAMMER update 10-Feb-2008
Matthew Dillon
dillon at apollo.backplane.com
Sun Feb 10 19:24:17 PST 2008
:So -- in this instance you are both the establishment and the revolutionary at
:the same time. Could you explain in extra-bonehead language what problems you
:were solving with the cluster model, and if you are still solving those problems
:with the newer model?
:
:Thanks!
The three major pieces didn't integrate together well enough. It's
really that simple. Individually they were great. Together they
were not.
* Small (64MB) clusters with localized B-Trees.
Plus: Recovering the filesystem from a massive failure not difficult.
Minus: Data and records cannot roam about, they have to be placed in
the particular cluster covering their key.
* Fine-grained A-list allocator.
Plus: Fine grained allocation and freeing.
Minus: It didn't help the problem of dead space in the cluster, or
for that matter help the issue of dealing with full clusters,
because particular records and data had to go into particular
clusters.
* Per-cluster recovery mechanic.
Plus: Very easy to recover a fixed portion of the B-Tree.
Minus: Couldn't cross cluster boundaries. Couldn't deal with
transactions that crossed cluster boundaries. Requires strict
write ordering and for clusters to be marked 'open' (virtually
synchronously it turned out) to even detect that recovery was needed.
I still would have needed to implement an UNDO FIFO on top of
everything else, and despite the per-buffer locality of reference
the model had it still made an excessive number of little modifications
all over the buffer that would have made the UNDO FIFO rather
expensive.
So I had a fine-grained allocator that couldn't contribute to solving
the issue of what to do when clusters were too full or too empty, a
recovery mechanic that couldn't guarantee consistency for transactions
which spanned more then one cluster, and clusters that wound up being
too small (64MB) and caused management I/O to become too dispersed.
Add to that the fact that the model required an excessive number of
special cases in-code. Do a cvs diff just of the B-Tree and cursor
code from before I ripped it out to now and you will see the new code
is much, much cleaner, with virtually no special cases.
--
The new model works a lot a better. Fine-grained allocation through
what will effectively be a sequential (blockmap) index. Extremely
good localization and packing of data (even better then the cluster
model had, and the cluster model was pretty good in that regard).
The ability (due to the blockmap) to rearrange the physical location
of blocks of nodes, records, and data, in 8 MB chunks, with no effort.
Coarse grained freeing which is extremely fast (emphasis on extremely..
it is ultra fast), but is still able to detect the 'all free' case
for a big-block, for some immediate gratification when deleting or
pruning large amounts of material.
For the UNDO FIFO, which I would have had to write anyway, the new
model generates far fewer dispersed modifications, resulting in
a smaller UNDO stream. A small UNDO stream reduces the number of
buffer dependancies to, in most cases, not more then one (i.e. 25 buffers
may need 1 buffer to be committed to disk before they can be committed).
The core allocator, next on my list to write, is literally just a
big-block allocator, dealing with large 8M blocks and not having to
deal with any fine-grained issues within those blocks. You can't get
much more straightforward then that!
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Kernel
mailing list