Two kinds of atomic commit

Daniel Phillips phillips at phunq.net
Mon Jul 28 13:04:29 PDT 2008


On Monday 28 July 2008 09:58, Matthew Dillon wrote:
> :1) The Update a Clone Method
> :
> :2) The Update in Place Method
> :
> :Daniel
> 
>     Hmm.  Those seem fairly complex.

Then I need to rewrite the post so it seems as simple as it is :-)

>     How would you deal with incidental 
>     operations on the B-Tree such as doing a split?  A single insert
>     could wind up requiring a split of every internal node on the way
>     down to the leaf.  I guess you could clone the blocks, but there are
>     a ton of parent and child linkages that have to be changed when you
>     split a node.  It sounds pretty expensive.

In this case a log transaction is created containing all of the split
nodes as physical updates and possibly some merged nodes from the free
tree.  Btree splitting can always be committed atomically and
independently of any other activity on the filesystem.  It is nicely
bounded by the btree depth.  Allocations that do not require splitting
the free tree are logged logically, leaving the affected free tree
blocks dirty in cache.  So the allocation updates come "for free",
being tucked into the same commit block that governs the split btree
leaves.

>     Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
>     Incidental meaning those operations (such as splits) which lead up to
>     an insertion or deletion but do not actually perform the insertion
>     or deletion.   On crash recovery you would undo partially completed
>     B-Tree operations and then REDO the related logical operations.

I now see how UNDO works, thankyou for your patient explanations.  The
point I overlooked is, fsync (and friends) is indeed the only barrier
you have to worry about.

Still, I think it is good to try to get as much as possible of what was
going on in a bit burst of activity with no fsync durably onto disk.
Redo will clearly leave the filesystem in a consistent state less far
back in time than undo.

So my general strategy is to log big changes like splits as "physical"
full block updates and small ones like allocating an extent as logical
edit records in the commit blocks of the physical updates.

The complexity you noted above is due to making sure that the on-disk
image of a block is always what the logical edit record expects it
to be at the time the logical edit is replayed.

>     Having to allocate new blocks for B-Tree deletions could create a big
>     issue when the filesystem becomes full or near-full.

Yes, allocate-in-free is usually nasty.  I need to ensure that there is
always a reserve of blocks available on disk that is more than the
maximum possible transaction that can be accepted.  This simple idea
can get really complex, I know.  One nice trick to simplify things a
little is to have a really generous limit on the maximum number of
dirty blocks that are allowed, when the filesystem has lots of free
space, and shrink that limit progressively as free space approaches
zero.

I now see what you were driving at with your point about interlocking
namespace transactions, etc.  While each VFS transaction can indeed be
committed on its own, atomically and independently, to do so would be
death for throughput.  So it is necessary to batch up a lot of these
transactions, and be mindful of the interdependencies between the VFS
transactions and the underlying data structures.  The rule is,
underlying data changes required by any given VFS transaction must
never lie in more than one commit.  This can be accomplished without
actually tracking the physical representation interdependencies between
VFS transactions.  Instead, just count the dirty cache blocks and send
out an atomic commit for the entire set of transactions when some
threshold is passed.  Now the challenge is to figure out how to avoid
stalling during the commit.

It is thanks to you, your searching questions and the example of Hammer
that I was forced to understand these issues clearly. :-)

Note!  In Linux, VFS means "Virtual Filesystem Switch", that is, the
security, synchronization and abstraction layer that exists above
filesystems in Linux.  I think it means "Virtual File System" to you,
which is just "filesystem" to Linux hackers.

Regards,

Daniel





More information about the Kernel mailing list