[Tux3] Comparison to Hammer fs design

Fri Jul 25 16:00:07 PDT 2008

(resent after subscribing to the the kernel at crater)

On Friday 25 July 2008 11:53, Matthew Dillon wrote:
> 
> :Hi;
> :
> :The announcement of yet another filesystem:
> :
> :http://lkml.org/lkml/2008/7/23/257
> :
> :led to some comments about hammer fs:
> :
> :http://tux3.org/pipermail/tux3/2008-July/000006.html
> :
> :enjoy,
> :
> :     Pedro.
> 
>     Those are interesting comments.   I think I found Daniel's email address
>     so I am adding him to the To:   Dan, feel free to post this on your Tux
>     groups if you want.

How about a cross post?

>     I did consider multiple-parentage...  that is the ability to have a
>     writable snapshot that 'forks' the filesystem history.  It would be
>     an ultra cool feature to have but I couldn't fit it into the B-Tree
>     model I was using.  Explicit snapshotting would be needed to make it
>     work, and the snapshot id would have to be made part of the B-Tree key,
>     which is fine.  HAMMER is based around implicit snapshots (being able
>     to do an as-of historical lookup without having explicitly snapshotted
>     bits of the history).

Yes, that is the main difference indeed, essentially "log everything" vs
"commit" style versioning.  The main similarity is the lifespan oriented
version control at the btree leaves.

>     I would caution against using B-Tree iterations 
>     in historical lookups, B-Trees are only fast if you can pretty much
>     zero in on the exact element you want as part of the primary search
>     mechanic.  Once you have to iterate or chain performance goes out the
>     window.  I know this because there are still two places where HAMMER
>     sometimes has to iterate due to the inexact nature of an as-of lookup.
>     Multiple-parentage almost certainly require two inequality searches,
>     or one inequality search and an iteration.  A single timeline only
>     requires one inequality search.

Once I get down to the leaf level I binary search on logical address in
the case of a file index btree or on version in the case of an inode
table block, so this cost is still Log(N) with a small k.  For a
heavily versioned inode or file region this might sometimes result in
an overflow block or two that has to be linearly searched which is not
a big problem so long as it is a rare case, which it really ought to be
for the kinds of filesystem loads I have seen.  A common example of a
worst case is /var/log/messages, where the mtime and size are going to
change in pretty much every version, so if you have hourly snapshots
and hold them for three months it adds up to about 2200 16 byte inode
table attributes to record, about 8 inode table leaf blocks.  I really
do not see that as a problem.  If it goes up to 100 leaf blocks to
search then that could be a problem.

Usually, I will only be interested in the first and last of those
blocks, the first holding the rarely changing attributes including the
root of the file index btree and the last holding the most recent
incarnation of the mtime/size attribute.

The penultimate inode table index block tells me how many blocks a
given inode lives in because several blocks will have the same inum
key.  So the lookup algorithm for a massively versioned file becomes:
1) read the first inode table block holding that inum; 2) read the last
block with the same inum.  The latter operation only needs to consult
the immediate parent index block, which is locked in the page cache at
that point.

It will take a little care and attention to the inode format,
especially the inode header, to make sure that I can reliably do that
first/last probe optimization for the "head" version, but it does seem
worth the effort.  For starters I will just do a mindless linear walk
of an "overflow" inode and get fancy if that turns out to be a problem.

>     I couldn't get away from having a delete_tid (the 'death version
>     numbers').  I really tried :-)  There are many cases where one is only
>     deleting, rather then overwriting.

By far the most common case I would think.  But check out the versioned
pointer algorithms.  Surprisingly that just works, which is not exactly
obvious:

   Versioned pointers: a new method of representing snapshots
   http://lwn.net/Articles/288896/

I was originally planning to keep all versions of a truncate/rewrite
file in the same file index, but recently I realized that that is dumb
because there will never be any file data shared by a successor version
in that case.  So the thing to do is just create an entirely new
versioned file data attribute for each rewrite, bulking up the inode
table entry a little but greatly constraining the search for versions
to delete and reducing cache pressure by not loading unrelated version
data when traversing a file.

>     Both numbers are then needed to 
>     be able to properly present a historical view of the filesystem.
>     For example, when one is deleting a directory entry a snapshot after
>     that deletion must show the directory entry gone.

. ..which happens in Tux3 courtesy of the fact that the entire block
containing the dirent will have been versioned, with the new version
showing the entry gone.  Here is one of two places where I violate my
vow to avoid copying an entire block when only one data item in it
changes (the other being the atime table).  I rely on two things to
make this nice: 1) Most dirent changes will be logically logged and
only rolled up into the versioned file blocks when there are enough to
be reasonably sure that each changed directory block will be hit
numerous times in each rollup episode.  (Storing the directory blocks
in dirent-create order as in PHTree makes this very likely for mass
deletes.)  2) When we care about this is usually during a mass delete,
where most or all dirents in each directory file block are removed
before moving on to the next block.

>     Both numbers are 
>     also needed to be able to properly prune out unwanted historical data
>     from the filesystem.  HAMMER's pruning algorithm (cleaning out old  
>     historical data which is no longer desired) creates holes in the
>     sequence so once you start pruning out unwanted historical data
>     the delete_tid of a prior record will not match the create_tid of the
>     following one (historically speaking).

Again, check out the versioned pointer algorithms.  You can tell what
can be pruned just by consulting the version tree and the create_tids
(birth versions) for a particular logical address.  Maybe the hang is
that you do not organize the btrees by logical address (or inum in the
case of the inode table tree).  I thought you did but have not read
closely enough to be sure.

>     At one point I did collapse the holes, rematching the delete_tid with
>     the create_tid of the following historical record, but I had to remove
>     that code because it seriously complicated the mirroring implementation.
>     I wanted each mirror to be able to have its own historical retention
>     policy independant of the master.  e.g. so you could mirror to a backup
>     system which would retain a longer and more coarse-grained history then
>     the production system.

Fair enough.  I have an entirely different approach to what you call
mirroring and what I call delta replication.  (I reserve the term
mirroring to mean mindless duplication of physical media writes.)  This
method proved out well in ddsnap:

   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257

(Sorry about the massive URL, you can blame Linus for that;-)

What I do in ddsnap is compute all the blocks that differ between two
versions and apply those to a remote volume already holding the first
of the two versions, yielding a replica of the second version that is
logically but not physically identical.  The same idea works for a
versioned filesystem: compute all the leaf data that differs between
two versions, per inum, and apply the resulting delta to the
corresponding inums in the remote replica.  The main difference vs a
ddsnap volume delta is that not all of the changes are physical blocks,
there are also changed inode attributes, so the delta stream format
has to be elaborated accordingly.

>     I also considered increasing the B-Tree fan-out to 256 but decided
>     against it because insertions and deletions really bloated up the
>     UNDO FIFO.  Frankly I'm still undecided as to whether that was a good 
>     idea, I would have prefered 256.  I can actually compile in 256 by
>     changing a #define, but I released with 64 because I hit up against
>     a number of performance issues:  bcopy() overheads for insertions
>     and deletions in certain tests became very apparent.  Locking
>     conflicts became a much more serious issue because I am using whole-node
>     locks rather then element locks.  And, finally, the UNDO records got
>     really bloated.  I would need to adjust the node locking and UNDO
>     generation code significantly to remove the bottlenecks before I could
>     go to a 256-element B-Tree node.

I intend to log insertions and deletions logically, which keeps each
down to a few bytes until a btree rollup episode comes along to perform
updating of the btree nodes in bulk.  I am pretty sure this will work
for you as well, and you might want to check out that forward logging
trick.

That reminds me, I was concerned about the idea of UNDO records vs
REDO.  I hope I have this right: you delay acknowledging any write
transaction to the submitter until log commit has progressed beyond the
associated UNDO records.  Otherwise, if you acknowledge, crash, and
prune all UNDO changes, then you would end up with applications
believing that they had got things onto stable storage and be wrong
about that.  I have no doubt you did the right thing there, but it is
not obvious from your design documentation.

>     HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
>     another limiting factor for the fan-out I can use.  My elements
>     are 64 bytes each.

Yes, I mostly have 16 byte elements and am working on getting most of
them down to 12 or 8.

>     64x64 = 4K per B-Tree node.  I decided to go 
>     with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
>     64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
>     other auxillary info.  That's instead of using a radix-compressed key.
>     Radix compressed keys would have doubled the complexity of the B-Tree
>     code, particularly with the middle-of-tree pointer caching that
>     HAMMER does.

I use two-stage lookup, or four stage if you count searches within
btree blocks.  This makes the search depth smaller in the case of small
files, and in the case of really huge files it adds depth exactly as
appropriate.  The index blocks end up cached in the page cache (the
buffer cache is just a flavor of page cache in recent Linux) so there
is little repeated descending through the btree indices.  Instead, most
of the probing is through the scarily fast radix tree code, which has
been lovingly optimized over the years to avoid cache line misses and
SMP lock contention.

I also received a proposal for a "fat" btree index from a collaborator
(Maciej) that included the file offset but I really did not like the...
fattening.  A two level btree yields less metadata overall which in my
estimation is more important than saving some bree probes.  The way
things work these days, falling down from level 1 cache to level 2 or
from level 2 to level 3 costs much more than executing some extra CPU
instructions.  So optimization strategy inexorably shifts away from
minimizing instructions towards minimizing cache misses.

>     The historical access mechanic alone added major complexity to the
>     B-Tree algorithms.  I will note here that B-Tree searches have a very
>     high cpu overhead no matter how you twist it, and being able to cache
>     cursors within the B-Tree is absolutely required if you want the
>     filesystem to perform well.  If you always start searches at the root
>     your cpu overhead will be horrendous... so plan on caching cursors
>     from the get-go.

Fortunately, we get that for free on Linux, courtesy of the page cache
radix trees :-)

I might eventually add some explicit cursor caching, but various
artists over the years have noticed that it does not make as much
difference as you might think.

>     If I were to do radix compression I would also want to go with a
>     fully dynamic element size and fully dynamic fan-out in order to
>     best-compress each B-Tree node.  Definitely food for thought.

Indeed.  But Linux is braindamaged about large block size, so there is
very strong motivation to stay within physical page size for the
immediate future.  Perhaps if I get around to a certain hack that has
been perenially delayed, that situation will improve:

   "Variable sized page objects"
   http://lwn.net/Articles/37795/

>     I'd love to do something like that.  I think radix compression would
>     remove much of the topological bloat the B-Tree creates verses using
>     blockmaps, generally speaking.

Topological bloat?

>     Space management is currently HAMMER's greatest weakness, but it only
>     applies to small storage systems.  Several things in HAMMER's design
>     are simply not condusive to a small storage.  The storage model is not
>     fine-grained and (unless you are deleting a lot of stuff) needs
>     reblocking to actually recover the freed space.  The flushing algorithms
>     need around 100MB of UNDO FIFO space on-media to handle worst-case
>     dependancies (mainly directory/file visibility issues on crash recovery),
>     and the front-end caching of modifying operations, since they don't
>     know exactly how much actual storage will be needed, need ~16MB of
>     wiggle room to be able to estimate the on-media storage required to
>     back the operations cached by the front-end.  Plus, on top of that,
>     any sort of reblocking also needs some wiggle room to be able to clean
>     out partially empty big-blocks and fill in new ones.

Ah, that is a very nice benefit of Tux3-style forward logging I forgot
to mention in the original post: transaction size is limited only by
available free space on the volume, because log commit blocks can be
anywhere.  Last night I dreamed up a log commit block variant where the
associated transaction data blocks can be physically discontiguous, to
make this assertion come true, shortly after reading Stephen Tweedie's
horror stories about what you have to do if you must work around not
being able to put an entire, consistent transaction onto stable media
in one go:

   http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

Most of the nasties he mentions just vanish if:

  a) File data is always committed atomically
  b) All commit transactions are full transactions

He did remind me (via time travel from year 2000) of some details I
should write into the design explicitly, for example, logging orphan
inodes that are unlinked while open, so they can be deleted on replay
after a crash.  Another nice application of forward logging, which
avoids the seek-happy linked list through the inode table that Ext3
does.

>     On the flip side, the reblocker doesn't just de-fragment the filesystem,
>     it is also intended to be used for expanding and contracting partitions,
>     and adding removing volumes in the next release.  Definitely a
>     multi-purpose utility.

Good point.  I expect Tux3 will eventually have a reblocker (aka
defragger).  There are some really nice things you can do, like:

  1) Set a new version so logical changes cease for the parent
     version.

  2) We want to bud off a given directory tree into its own volume,
     so start by deleting the subtree in the current version.  If
     any link counts in the directory tree remain nonzero in the
     current version then there are hard links into the subtree, so
     fail now and drop the new version.

  3) Reblock a given region of the inode table tree and all the files
     in it into one physically contiguous region of blocks

  4) Add a free tree and other once-per volume structures to the new
     home region.

  5) The new region is now entirely self contained and even keeps its
     version history.  At the volume manager level, map it to a new,
     sparse volume that just has a superblock in the zeroth extent and
     the new, mini filesystem at some higher logical address.  Remap
     the region in the original volume to empty space and add the
     empty space to the free tree.

  6) Optionally reblock the newly budded filesystem to the base of the
     new volume so utilties that do not not play well with sparse
     volumes do not do silly things.

>     So I'm not actually being cavalier about it, its just that I had to
>     make some major decisions on the algorithm design and I decided to
>     weight the design more towards performance and large-storage, and
>     small-storage suffered a bit.

Cavalier was a poor choice of words, the post was full of typos as well
so I am not proud of it.  You are solving a somewhat different problem
and you have code out now, which is a huge achievement.  Still I think
you can iteratively improve your design using some of the techniques I
have stumbled upon.  There are probably some nice tricks I can get from
your code base too once I delve into it.

>     In anycase, it sounds like Tux3 is using many similar ideas.  I think
>     you are on the right track.

Thankyou very much.  I think you are on the right track too, which you
have a rather concrete way of proving.

>     I will add one big note of caution, drawing 
>     from my experience implementing HAMMER, because I think you are going
>     to hit a lot of the same issues.
> 
>     I spent 9 months designing HAMMER and 9 months implementing it.  During
>     the course of implementing it I wound up throwing away probably 80% of
>     the original design outright.  Amoung the major components I had to
>     rewrite were the LOG mechanic (which ultimately became the meta-data
>     UNDO FIFO), and the fine-grained storage mechanic (which ultimately
>     became coarse-grained).  The UNDO FIFO was actually the saving grace,
>     once that was implemented most of the writer-ordering dependancies went
>     away (devolved into just flushing meta-data buffers after syncing the
>     UNDO FIFO)... I suddenly had much, much more freedom in designing
>     the other on-disk structures and algorithms.
>
>     What I found while implementing HAMMER was that the on-disk topological
>     design essentially dictated the extent of HAMMER's feature set AND
>     most of its deficiencies (such as having to reblock to recover space),
>     and the algorithms I chose often dictated other restrictions.  But the
>     major restrictions came from the on-disk structures.
> 
>     Because of the necessarily tight integration between subsystems I found
>     myself having to do major redesigns during the implementation phase.
>     Fixing one subsystem created a cascade effect that required tweaking other
>     subsystems.  Even adding new features, such as the mirroring, required
>     significant changes in the B-Tree deadlock recovery code.  I couldn't get
>     away from it.  Ultimately I chose to simplify some of the algorithms
>     rather then have to go through another 4 months of rewriting.  All
>     major designs are an exercise in making trade-offs in topology, feature
>     set, algorithmic complexity, debuggability, robustness, etc.  The list
>     goes on forever.
> 

The big ahas! that eliminated much of the complexity in the Tux3 design
were:

  * Forward logging - just say no to incomplete transactions
  * Version weaving - just say no to recursive copy on write

Essentially I have been designing Tux3 for ten years now and working
seriously on the simplifying elements for the last three years or so,
either entirely on paper or in related work like ddsnap and LVM3.

One of the nice things about moving on from design to implementation of
Tux3 is that I can now background the LVM3 design process and see what
Tux3 really wants from it.  I am determined to match every checkbox
volume management feature of ZFS as efficiently or more efficiently,
without violating the traditional layering between filesystem and
block device, and without making LVM3 a Tux3-private invention.

>     Laters!

Hopefully not too much later.  I find this dialog very fruitful.  I just
wish such dialog would occur more often at the design/development stage
in Linux and other open source work instead of each group obsessively
ignoring all "competing" designs and putting huge energy into chatting
away about the numerous bugs that arise from rushing their design or
ignoring the teachings of history.

Regards,

Daniel