[Tux3] Comparison to Hammer fs design
Matthew Dillon
dillon at apollo.backplane.com
Fri Jul 25 12:01:38 PDT 2008
:Hi;
:
:The announcement of yet another filesystem:
:
:http://lkml.org/lkml/2008/7/23/257
:
:led to some comments about hammer fs:
:
:http://tux3.org/pipermail/tux3/2008-July/000006.html
:
:enjoy,
:
: Pedro.
Those are interesting comments. I think I found Daniel's email address
so I am adding him to the To: Dan, feel free to post this on your Tux
groups if you want.
I did consider multiple-parentage... that is the ability to have a
writable snapshot that 'forks' the filesystem history. It would be
an ultra cool feature to have but I couldn't fit it into the B-Tree
model I was using. Explicit snapshotting would be needed to make it
work, and the snapshot id would have to be made part of the B-Tree key,
which is fine. HAMMER is based around implicit snapshots (being able
to do an as-of historical lookup without having explicitly snapshotted
bits of the history). I would caution against using B-Tree iterations
in historical lookups, B-Trees are only fast if you can pretty much
zero in on the exact element you want as part of the primary search
mechanic. Once you have to iterate or chain performance goes out the
window. I know this because there are still two places where HAMMER
sometimes has to iterate due to the inexact nature of an as-of lookup.
Multiple-parentage almost certainly require two inequality searches,
or one inequality search and an iteration. A single timeline only
requires one inequality search.
I couldn't get away from having a delete_tid (the 'death version
numbers'). I really tried :-) There are many cases where one is only
deleting, rather then overwriting. Both numbers are then needed to
be able to properly present a historical view of the filesystem.
For example, when one is deleting a directory entry a snapshot after
that deletion must show the directory entry gone. Both numbers are
also needed to be able to properly prune out unwanted historical data
from the filesystem. HAMMER's pruning algorithm (cleaning out old
historical data which is no longer desired) creates holes in the
sequence so once you start pruning out unwanted historical data
the delete_tid of a prior record will not match the create_tid of the
following one (historically speaking).
At one point I did collapse the holes, rematching the delete_tid with
the create_tid of the following historical record, but I had to remove
that code because it seriously complicated the mirroring implementation.
I wanted each mirror to be able to have its own historical retention
policy independant of the master. e.g. so you could mirror to a backup
system which would retain a longer and more coarse-grained history then
the production system.
--
I also considered increasing the B-Tree fan-out to 256 but decided
against it because insertions and deletions really bloated up the
UNDO FIFO. Frankly I'm still undecided as to whether that was a good
idea, I would have prefered 256. I can actually compile in 256 by
changing a #define, but I released with 64 because I hit up against
a number of performance issues: bcopy() overheads for insertions
and deletions in certain tests became very apparent. Locking
conflicts became a much more serious issue because I am using whole-node
locks rather then element locks. And, finally, the UNDO records got
really bloated. I would need to adjust the node locking and UNDO
generation code significantly to remove the bottlenecks before I could
go to a 256-element B-Tree node.
HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
another limiting factor for the fan-out I can use. My elements
are 64 bytes each. 64x64 = 4K per B-Tree node. I decided to go
with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
other auxillary info. That's instead of using a radix-compressed key.
Radix compressed keys would have doubled the complexity of the B-Tree
code, particularly with the middle-of-tree pointer caching that
HAMMER does.
The historical access mechanic alone added major complexity to the
B-Tree algorithms. I will note here that B-Tree searches have a very
high cpu overhead no matter how you twist it, and being able to cache
cursors within the B-Tree is absolutely required if you want the
filesystem to perform well. If you always start searches at the root
your cpu overhead will be horrendous... so plan on caching cursors
from the get-go.
If I were to do radix compression I would also want to go with a
fully dynamic element size and fully dynamic fan-out in order to
best-compress each B-Tree node. Definitely food for thought. I'd
love to do something like that. I think radix compression would
remove much of the topological bloat the B-Tree creates verses using
blockmaps, generally speaking.
Space management is currently HAMMER's greatest weakness, but it only
applies to small storage systems. Several things in HAMMER's design
are simply not condusive to a small storage. The storage model is not
fine-grained and (unless you are deleting a lot of stuff) needs
reblocking to actually recover the freed space. The flushing algorithms
need around 100MB of UNDO FIFO space on-media to handle worst-case
dependancies (mainly directory/file visibility issues on crash recovery),
and the front-end caching of modifying operations, since they don't
know exactly how much actual storage will be needed, need ~16MB of
wiggle room to be able to estimate the on-media storage required to
back the operations cached by the front-end. Plus, on top of that,
any sort of reblocking also needs some wiggle room to be able to clean
out partially empty big-blocks and fill in new ones.
On the flip side, the reblocker doesn't just de-fragment the filesystem,
it is also intended to be used for expanding and contracting partitions,
and adding removing volumes in the next release. Definitely a
multi-purpose utility.
So I'm not actually being cavalier about it, its just that I had to
make some major decisions on the algorithm design and I decided to
weight the design more towards performance and large-storage, and
small-storage suffered a bit.
--
In anycase, it sounds like Tux3 is using many similar ideas. I think
you are on the right track. I will add one big note of caution, drawing
from my experience implementing HAMMER, because I think you are going
to hit a lot of the same issues.
I spent 9 months designing HAMMER and 9 months implementing it. During
the course of implementing it I wound up throwing away probably 80% of
the original design outright. Amoung the major components I had to
rewrite were the LOG mechanic (which ultimately became the meta-data
UNDO FIFO), and the fine-grained storage mechanic (which ultimately
became coarse-grained). The UNDO FIFO was actually the saving grace,
once that was implemented most of the writer-ordering dependancies went
away (devolved into just flushing meta-data buffers after syncing the
UNDO FIFO)... I suddenly had much, much more freedom in designing
the other on-disk structures and algorithms.
What I found while implementing HAMMER was that the on-disk topological
design essentially dictated the extent of HAMMER's feature set AND
most of its deficiencies (such as having to reblock to recover space),
and the algorithms I chose often dictated other restrictions. But the
major restrictions came from the on-disk structures.
Because of the necessarily tight integration between subsystems I found
myself having to do major redesigns during the implementation phase.
Fixing one subsystem created a cascade effect that required tweaking other
subsystems. Even adding new features, such as the mirroring, required
significant changes in the B-Tree deadlock recovery code. I couldn't get
away from it. Ultimately I chose to simplify some of the algorithms
rather then have to go through another 4 months of rewriting. All
major designs are an exercise in making trade-offs in topology, feature
set, algorithmic complexity, debuggability, robustness, etc. The list
goes on forever.
Laters!
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Kernel
mailing list