dillon at apollo.backplane.com
Mon Nov 5 12:36:29 PST 2007
:Speaking of on-disk B-Trees, ReiserFS' biggest problems are all based
:on its use of flexible B-Trees. For instance, the difficulty of
:correctly rebuilding the tree if a node is damaged, and correctly
:detecting if a ReiserFS system is hosted on a file in another
:(supposedly damaged) ReiserFS system, are noted as big problems that
:are supposedly solved for Reiser4. Are there likely to be similar
:issues in Hammer for the time being, or have you already planned far
:ahead for these cases with extra information in the nodes?
HAMMER has an independant B-Tree for each cluster. The clusters are
then glued together (leaf -> new cluster root) to form a unified B-Tree.
Except for lookups and no-free-space cases, B-Tree operations are
local to a cluster.
A cluster in HAMMER is around 64MB and can have up to 4096 records.
If a cluster needs to be recovered, HAMMER will simply throw away
the B-Tree and regenerate it from scratch using the cluster's
record list. This way all B-Tree I/O operations can be asynchronous
and do not have to be flushed on fsync.
At the same time HAMMER will remove any records whos creation
transaction id's are too large (i.e. not synchronized with the cluster
header), and will zero out the delete transaction id for any records
whos deletion transaction id's are too large.
Essentially recovery is handled by performing a rollback on a
HAMMER doesn't bother to scan clusters at mount time (that would be
expensive!). Instead it will recover a cluster on the fly when the
cluster is first accessed and HAMMER notices that it is marked 'open'
when it shouldn't be.
The real performance issue for HAMMER is going to be B-Tree insertions
and rebalancing across clusters. I think most of the issues can be
resolved with appropriate heuristics and by a backup ground process to
slowly rebalance clusters. This will require a lot of work, though,
and only minimal rebalancing will be in for the release.
More information about the Kernel