HAMMER update 06-Feb-2008
dillon at apollo.backplane.com
Wed Feb 6 14:02:49 PST 2008
Work continues to progress well but I've hit a few snags:
* My current recovery model is just not working for cross-cluster
* I can't fix the write performance issue due to the way the cluster
localization and spike code works.
* Handling nearly full filesystems is proving to be a problem due to
the inability to estimate space requirements which in turn are
related to the cluster localization the current model uses.
* Efficient real-time mirroring is just not easy to do w/ the current
Everything else now works, and works well, including and most especially
the historical access features. I've already fallen in love with being
able to create a softlink to a '@@0x<timestamp>' string to access
I've come to the conclusion that I am going to have to make a fairly
radical change to the on-disk structures to solve these problems. On
the plus side, these changes will greatly simplify the filesystem topology
and greatly reduce its complexity. On the negative side, recovering
space will not be instantanious and will basically require data to
be copied from one part of the filesystem to another. My take is that
since large filesystems (our target) do not usually fill up very quickly,
and indeed it can take hours even if you are writing 60MB/sec to the disk,
It is well worth taking a hit on free space recovery if it accomplishes
all of the filesystem's other goals.
This may seem a bit radical as solutions go, but the more I look at it
the more I like it because it really does neatly tie up every single
* Get rid of super-clusters, clusters, and typed buffers. Get rid of them
entirely. Just have an array of buffers after the volume header.
* Get rid of the A-list's (the recursive bitmap radix tree allocation
model). Poof, gone. All of them, gone. Have no random block
allocation model at all.
* Get rid of the per-cluster B-Tree's. Instead have just one global
B-Tree for the entire filesystem, able to access any record anywhere
in the filesystem (currently a per-cluster B-Tree can only access
information within that cluster).
* Implement the filesystem as one big huge circular FIFO, pretty much
laid down linearly on disk, with a B-Tree to locate and access data.
* Random modifications (required for manipulating the B-Tree and marking
records as deleted) will append undo records to this FIFO and the only
write ordering requirement will be that the buffers containing these
undo records be committed before the buffers containing the random
modifications are committed.
This completely solves the crash recovery issue, regardless of the
size of a transaction. If you crash half way through doing a 200MB
write() it will be able to unwind the whole thing.
This (plus the B-Tree and layout changes) completely solves the
write performance issue.
* Physical recovery of 'disk space' in the filesystem requires advancing
the circular FIFO's beginning index, which means copying data we
wish to retain from the front of the FIFO to the end and discarding
data at the head that we wish to destroy.
This is the one big gotcha to the model, but the key point here is
that the copying can occur off-peak, or whenever disk bandwidth is
available, and can operate on large linear swaths of disk at once
(with some random async writes for B-Tree fixups).
* Since information is laid down linearly, on-disk space can be reserved
for in-memory buffer and record caches, solving the issue of how
to deal with nearly full filesystems.
* Ultimately physical recovery can be optimized by introducing a
large-block allocation model 'behind' the circular FIFO abstraction.
This will solve the issue of dealing with bad blocks on the disk.
* The filesystem can be recovered from horrible data loss basically by
doing a copy-scan to first remove all auxillary data (like B-Tree nodes
and older undo records that are no longer needed), and then regenerating
the B-Tree from scratch for the entire filesystem.
This would of course take a long time but it would only be needed to
recover data after a head crash, where physical corruption might be
* This also neatly solves real-time mirroring issues. A real-time mirror
can simply track the FIFO linearly, and can record its FIFO index if
interrupted to pick up where it left off later on. You can't get
much better then that.
I had given up on efficient real time mirroring when I settled on the
localized cluster model. Now its back.
It will probably take a week or two to rewire the topology and another
week or two to debug it. Despite the massive rewiring, the new model
is much, MUCH simpler then the old, and all the B-Tree code is retained
(just extended to operate across the entire filesystem instead of just
within a single cluster).
<dillon at backplane.com>
More information about the Kernel