[Tux3] Comparison to Hammer fs design

Daniel Phillips phillips at phunq.net
Tue Aug 5 01:27:54 PDT 2008

On Sunday 27 July 2008 20:46, Matthew Dillon wrote:
> :Tux3 never backs anything out, so there is some skew to clear up.
>     Got it.  I'm still not clear on the physical vs logical log
>     interactions required for crash recovery, but I think I understand
>     everything else.

The reason for my unclear crash recovery algorithm is, it was not fully
designed when we started chatting, only the basic principles of robust
micro-transaction commit.  You forced me to go work out the details of
the high level transaction architecture.

The big points you got me focussed on were, what about mass commits a la
fsync, and how do you sort out the ties that arise between otherwise
independent high level operations, and how do you preserve the ordering
of low level operations to reflect the ordering constraints of the high
level operations?  Which is what you were trying to say with your create
a, create a/b example, I now realize.

> :>     One of the reasons why HAMMER uses an UNDO log is that it is a lot
> :>     less expensive then a REDO log.  I'll show you why:
> :> 
> :>     Physical operations related to a single flush group in HAMMER:
> :> 
> :> 	modify element in B-Tree node A	(create UNDO w/ original contents of
> :> 					 B-Tree node A)
> :> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> :> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> :> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> :> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> :
> :Nice, now I understand.  But do you not have to hold all filesystem
> :transactions that depend on the modification until the btree node has
> :completed writing to disk?  With logical logging you only have to wait
> :on the logical commit to complete, into which may be packed a number of
> :other changes for efficiency.
>     The only major interlock between frontend searches and backend flushes
>     occurs on an inode-by-inode basis and only for the duration of the
>     modifying B-Tree operation, which does *not* include the writing of
>     the modified meta-data buffer to disk.

I am well down path to identifying a similar frontend/backend
separation in Tux3.  The back end data consists of all the dirty
metadata and file data cache blocks (buffer cache and page cache(s)
respectively) and the "phase" boundaries, which separate the results of
successive large groups of high level operations so that there are no
disk-image dependencies between them.

I now have two nice tools for achieving such a clean separation between
transaction phases:

  1) Logical log records

  2) Dirty cache block cloning

The latter takes advantage of logical log records to limit the effect of
a high level operation to just the leaf blocks affected.  Where two
operations that are supposed to be in two different transaction phases
affect the same leaf block(s) then the block(s) will be cloned to
achieve proper physical separation.  So I can always force a clean
boundary between phases without generating a lot of dirty blocks to do
it, an important property when it comes to anti-deadlock memory

The payoff for a clean frontend/backend separation is, as you say, a
clean design.  The art in this is to accomplish the separation without
any throughput compromise, which I am willing to believe you have done
in Hammer, and which I now wish to achieve by a somewhat different
means in Tux3.

>     Once the operation concludes 
>     we are left with dirty meta-data buffers which for all intents and
>     purposes have transfered the in-memory cached B-Tree records (held
>     as separate entities in a red-black tree in-memory) into the media
>     view of the B-Tree (having performed the actual B-Tree operations
>     necessary to insert or delete the record), and the inode lock is
>     then released.  The flusher flushes each inode on the flush list in
>     turn, with certain long-lived inodes (e.g. truncation of a 1TB
>     file to 0) only getting partial goodness and then being moved to
>     the next flush list.  The inode lock is held for each individual
>     inode as it is being flushed, one inode at a time.

I need a means of remembering which blocks are members of which phases,
and an rbtree may be indeed better than the list I was thinking of.  A
hash table is probably ok too.  I also have to remember the logical
update records that were staged when the high level operation took place
in order to avoid having the btree leaf changes climb up into the index
nodes copy on write style, and to avoid constant changes to the free map

I can coopt some bits of the buffer flags to keep track of which phase a
given dirty buffer belongs to.  To avoid having to attach buffers to
every dirty page cache page, I might use some page flags in that case. 
Or that state can be in a secondary bookkeeping structure like the
rbtree or hash.  Details to be settled before the kernel coding starts.

>     The actual dirty meta-data buffers are flushed later on, without
>     any inode lock held and without having any blocking effect on the
>     frontend.

Roger, same here.  The inode locks and things like rename lock are only
held during the course of the high level operation.  Tux3 does not have
to worry about that at all, the VFS takes care of it nicely.  Tux3 only
has to lock the data structures that keep track of the phases
(spinlocks) and the btrees embedded in the buffer and page cache
(possibly hashed mutexes).

Note: when I write VFS I mean "virtual filesystem switch" and when you
write VFS you mean "virtual filesystem", which in Linux is just called
a filesystem.

>     This media-write sequence occurs at the end, after all 
>     inodes have been processed.  UNDO finishes writing, we wait for
>     I/O to complete, we update the volume header, we wait for I/O
>     to complete, then we asynchronously flush the dirty media buffers
>     (and don't wait for that I/O to complete so the writes can be merged
>     with the beginning of the next flush group).

Here, Tux3 (and Hammer) can always set up a very long forward log chain,
only updating the header once per multi-transaction commit phase.

>     That is what I meant when I said the frontend was disconnected
>     from the backend.  The worst the frontend ever does is issue reads
>     on media meta-data as part of any merged memory-media B-Tree searches
>     it must perform to locate information.  Any modifying operations made
>     by the frontend will create records in the in-memory red-black tree,
>     but will *NOT* do any modifying oprations on meta-data buffers
>     representing the on-media B-Tree.  Frontend lookups are resolved by
>     doing a merged-search with two cursors (one on the in-memory red-black
>     tree, and one on the on-media B-Tree).

It is beautiful, now that I understand it.

>     I considered doing the interlock at the record level but hit
>     some hicups in that one of the cursors may have already passed the
>     memory record being committed but the other cursor (the media view)
>     may not have yet passed the B-Tree index where that record would be
>     written to.  So if a commit were to occur right then, without a
>     governing inode lock, the merged search might see a duplicate record
>     (or no record in either view in the case of a deletion).

Interlock at the micro-transaction level really sucks.  There are way
too many interlocks.  It is much nicer just to have the phase
boundaries, and not care about the dependencies between micro
transactions withing the transaction phase, knowing that they will all
be satisfied because the entire phase will be committed atomically.

Now here is another way of looking at the difference between undo and
redo.  Each of my redo log entries leaves the filesystem meta-metadata
in a consistent state in the sense that all btree structure will be
consistent.  But the logical metadata may be inconsistent, for example,
a directory may have appear to have been deleted on disk while some
deletions of files it contained were not persistently recorded.  I think
one can take advantage of such partial, low level consistency in order
to do a rapid scan and repair, preserving as much as possible of what
the filesystem was doing before the crash of a busy system with lots of
transactions in flight.

> :>     I just run the UNDOs backwards, which allows me to cache the in-flush
> :>     undo areas with a fixed set of small throw-away tracking structures
> :>     in system memory.  If a structure gets thrown away it just means an
> :>     extranious UNDO record will be appended to the UNDO FIFO.
> :
> :I think I see it, but I have my doubts because you have to block
> :transactions waiting for the up to date copy of the btree data to land
> :on disk.  Either that, or you may give userspace the impression that
> :some transaction has gotten onto stable storage when that is not the
> :case.
>     Userland doesn't care whether the actual transactions have made it
>     to disk or not unless it actually calls fsync().  If it does then
>     it will block through the necessary flush cycles.  fsync() on
>     HAMMER is pretty slow as it will likely be syncing a lot more
>     in the flush group then just the requested inode.

Yes, this was the point that finally made me understand how your undo
approach works.  Sorry for making you explain it so many times ;-)

>     If userland does not call fsync() (or 'sync', etc) then it doesn't
>     block at all, except as described above during the modify-buffer
>     phase of the B-Tree updates (on an inode-by-inode basis so actual
>     blocking is virtually non-existant).  It is often possible to copy
>     an entire directory tree without initiating a backend flush.

I presume that if one userland process is blocked on fsync, the others
are still able to continue dumping their operations into cache?

>     Media B-Tree elements, inclusive of any held in dirty meta-data buffers,
>     are locked on a B-Tree node by B-Tree node basis.  These are very
>     short lived locks and have nothing to do with the actual flushing
>     of the dirty meta-data buffers to the physical media.  The frontend
>     always gets shared locks, the backend always gets exclusive locks.
>     Some blocking between the frontend and backend occurs but not a whole
>     lot.  The backend flusher actually blocks against itself more often
>     (it runs multiple threads), due to the locality of reference within
>     the B-Tree for nearby inodes.

There is no need for multiple flushing threads in Tux3, I will use bio
submissions instead, which are completely asynchronous, returning
completion status in interrupt mode as they do.

>     I'd recommend dropping support for NFSv2.  It is not really worth 
>     supporting any more.   Does it even support 64 bit inodes? (I don't 
>     remember), or 64 bit file offsets?  NFSv2 is garbage.

It's a thought.  But anyway, I don't see a big downside to the PHTree
approach.  It might even end up faster than HTree (which does not
maintain dirent stability) by virtue of the slightly smaller (5-10%)
cache footprint.  If classic HTree turns out to be faster, I could offer
both as an option.  The code that differs between them should be about

>     You should be able to use 63 bits of the cookie, I don't know why
>     you wouldn't use the high bit of the lsb 32 bit part.  There is no
>     requirement that that bit be 0.  In fact, the RFC says the cookie is
>     a 64 bit unsigned integer and you should be able to use all 64 bits.
>     If linux is not allowing all 64 bits to be used then it's a serious
>     bug in linux.  The cookies are supposed to be opaque, just like the
>     file handle.

You are right about the cookie 64 bitness, I don't know where I picked
up that lore, mea culpa.

> :>     which means that records in the B-Tree aren't even close to being
> :>     in any sort of lexical order.  I will probably wind up changing
> :>     the hash calculation a bit to get some lexical sorting information
> :>     stuffed into it... should be doable since the hash keys are 64 bits.
> :
> :Yes, I noticed that.  Check out dx_hack_hash:
> :
> :   http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38
> :..
> :It distributed hashes of ascii strings very well for me, with few
> :funnels.  It way outperformed some popular hashes like TEA.  Ted's cut
> :down crytographic hash is yet more uniform but costs much more CPU.
>     I like :-).  I'll probably end up with multiple directory hash formats
>     too before this is over.  Fortunately I have room in the inode to
>     specify the format the directory will use.

I overlooked the part where you mentioned giving some lexical ordering
to your hash to get better locality between lexically related records. 
I am not sure this is a win, it will require a lot of analysis, which
starts to drift off the critical path for me, since crystallizing the
transaction design is really a lot more important.

An inverse idea to making the hash nonrandom is to make the goal inum
depend somewhat on the name hash for a newly created file.  So
relatively fewer PHTree index leaf nodes, which contain [hash, dirblock]
pairs, will be have to be reloaded into cache during deletion of a
massive directory in physical dirent order.  The goal inum has to depend
mostly on the physical dirent location though, to avoid thrashing the
inode table during mass creation, deletion or stat.  I am not sure
whether there is a gain to be had vs keeping the random hash completely
random, which does deliver a benefit in balancing the tree better.

> :Yes, I have thought about it a little more, and I imagine something
> :like a small array of dirty bits that climb up the btree with the help
> :of logical logging, where each bit means that there is something
> :interesting for some corresponding replication target to look at,
> :somewhere down in the subtree.
>     You can do with one transaction id / mtime field in the B-Tree node
>     header.  Strictly speaking you don't need a per-element field on
>     internal elements, though I have one there too because I had the space.

Yes, the common case of replication is only against the "current"
version and that optimization will work well there.  There might be
multiple current versions in a Tux3 volume,

In the case of many virtual servers running off the same filesystem
volume, each with its own writable snapshot, and each replicating
thatsnapshot somewhere, I need to be able to track dirtiness in multiple
versions simultaneously.  Or just not worry to much about that and sweep
through the entire inode table btree.  Multiple dirty block streams
could be output in one pass in this case.

> :>     Not for directory entries.  I'm pretty sure Linux uses a throw-away
> :>     namecache just like we do, so unless it has a mechanic for flush
> :>     sequencing of dirty namecache entries you need in-VFS structural
> :>     support for caching the operations.
> :
> :The Linux dentry cache actually implements proper namespace semantics
> :all by itself without needing to flush anything, which is what Ramfs
> :is.  Ramfs just lives entirely in cache without even being backed by
> :a ramdisk, until the power goes off.
>     That's cool.  You will want to flush it since you are backing the
>     system caches with a real filesystem, and you would probably need
>     a tracking structure of some sort anyway to deal with the 
>     directory entry create/delete dependancies.

It flushes itself out when it is not pinned by something like ramfs for
example, which has no backing store.  The VFS takes care of the
create/delete dependencies.

> :>     Jeeze, BSD's have been doing that forever.  That's what VOP_BMAP is
> :>     used for.  I'm a little surprised that Linux doesn't do that yet.
> :>     I'll expand on that down further.
> :
> :Either you forgot to expand or I missed it.  I am interested.
>     Oops, I think I did forget to expand that.  Basically buffer cache
>     buffers under kernel management are logically indexed.  Each vnode
>     has its own tree of buffer cache buffers.
>     When the kernel wants to flush dirty buffers or cluster adjacent
>     buffers together for I/O it uses VOP_BMAP to map the logical offsets
>     to physical offsets (so kernel-supported clustering is thus based
>     on physical offsets).  The VFS is responsible for implementing the BMAP
>     operation.  It is the BMAP operation which assigns actual physical
>     media blocks to a high level logically mapped filesystem buffer.
>     The mapping is persistent as long as the buffer cache buffer exists.
>     Of course clean buffers can be thrown away by the kernel so at some
>     point in the future the kernel might re-request the VOP_BMAP for
>     that logical offset.
>     The VFS can choose when to implement the physical mapping.  It can
>     tell the kernel to take a hike and do it when the kernel actually
>     writes the buffer out (being a logical buffer the flush goes through
>     the VFS via VOP_STRATEGY), it can do it early on when the actual
>     read() or write() opration occurs, etc.

That sounds a lot like the <fs>_get_block interface the Linux filesystem
block IO library uses to map buffers to physical addresses.  That is just
a set of convenience functions the filesystem can use to satisfy such
vmm calls into the filesystem as ->writepage.  The filesystem just has
to supply its own _get_block method.  Between that and mount options
parsing, there is not a whole lot more you have to do to bring up a
filesystem on Linux, that is, if you are not really worried about
performance.  Mapping each block with a separate call into the
filesystem actually sucks pretty badly CPU wise, but this is masked by
slow spindle speeds that stretch the CPU out over time.  That masking
disappears with some of the high performance storage hardware coming
out now and the wasteful effects of that interface start to show.

>     BSD buffer cache buffers are backed by vm_page_array[], an array of
>     VM pages, and the base of the buffer does not have to be page-aligned.
>     So variable page sizes are not necessary.

vm_page_array... is that like a Linux page cache?  (aka mapping, aka
address_space)  Buffers on page cache pages do not have to be
page-aligned in Linux, or rather, each page in Linux is either fully
covered with buffers or has none at all.  What sucks about all this is
having buffers _and_ page descriptors, where the two perform very
similar functions and duplicate a lot of state, causing endless bugs
due to skew in the cached state between the two kinds of objects.

> :Indeed, I need to handle that.  Since I do want to stick with a simple
> :linear sequence of logical log commits for now, I do not want to leave
> :any of them sitting around for a log time.  One easy thing to do is to
> :put any orphans aside at rollup time, in an unversioned orphan file, say.
>     I had a little time to think about it.  I think the easiest solution
>     is, if you are forced to flush an open inode with 0 links, to actually
>     record the open state in a separate B-Tree so upon remounting after a
>     crash you can simply scan that B-Tree and finish off the inodes
>     you find in that tree.  That way it would not have to be held in the
>     log at all as anything other then a transitory entry.

That is exactly the idea.  The question is whether to have a dedicated
btree or to use a file which is also a btree.  The orphan file (if that
turns out to be the mechanism) would implement a simple linked list of
orphans, and the transitory log entries would reduce the frequency the
orphan file has to be updated, allowing it to be updated in batches.

This post required a week to respond to because I thought I should also
post some code in parallel.



More information about the Kernel mailing list