Two kinds of atomic commit

Sun Jul 27 22:53:09 PDT 2008

Here I describe two slightly different methods that Tux3 will use to
implement atomic commit of data and metadata.  Both methods combine 
logical and physical forward logging to perform atomic commits
efficiently.  The discussion below assumes we are updating a btree leaf,
for example to make room for more inode data or to add pointers to a
file index.  The same techniques apply equally well to all structures
in the filesystem.

1) The Update a Clone Method

Instead of directly modifying a disk block corresponding to a btree
leaf, we allocate a new block for the leaf and copy the contents of
the original block to the new block, only in the buffer cache (no copy
is performed on disk).  We can now alter the new block at will and
flush it out to disk without affecting the on-disk btree at all.  But
have not yet linked the new block into the btree.  We could accomplish
that by performing a similar clone update recursively up to the root of
the tree, which creates a new tree root.  The whole chain of new blocks
would then be flushed out to disk and a pointer to the new root stored
at some predictable location so it can be found later.  This is the
"phase tree" method that I invented for Tux2, and is commonly called
"copy on write" these days.  It could also be called the "sue me" method
because Netapp likes to sue those such as Sun who implement it.

Fortunately, there is a better way to do it that I invented recently and
which I have never heard of anyone using before.  We modify only the
leaf node of the btree by cloning and record the pointer to the clone
only in the cached btree index block, not on disk.  To be able to
reconstruct the cached version of the index node after a crash, we log a
logical change record to the disk that says "write this leaf into that
btree index node".

We make whatever changes we need to the clone of the leaf node, then
construct a two block transaction consisting of the clone and a commit
block.  The commit block points at the new leaf node and also carries
the logical update record that describes how to modify the parent index
node recorded on disk to point at the new leaf.  If the commit block can
be allocated immediately adjacent to the clone then we can construct a
single bio to transfer the clone and the commit block to disk in one
operation.  Otherwise we construct two bios.  If there is some previous
commit transaction that has been staged but not yet committed, then we
modify its commit block to point at the location where our new commit
block will be stored.  Otherwise we have to seek to some known location
on the disk to store the location of the new commit.

To be able to tell after a crash whether the commit block and its
associated data made it to disk, we store a sequence number to
distinguish this commit block from a possible similar commit that might
have landed accidentally at the same location earlier, and store a hash
of the two blocks in the commit block.

Now the transaction is ready to go out to disk.  But if the disk is busy
with previous traffic (we hope it is) then we just stage the transaction
for commit, and commit the predecessor transaction instead.  This gives
a chance for any following transaction to extend the forward log chain
instead of having to start a new chain, which would need an extra seek
and transfer.

The original version of the leaf node on disk is not yet freeable,
because it may be needed after a crash (together with the logical update
record) to reconstruct the cached btree node image.  Only after we "roll
up" the logical update log entry by atomically updating the parent node
may we free the original leaf block and the commit block that recorded
the logical update.  Such a rollup happen periodically, otherwise we
would end up with an unreasonably long chain of logical updates
needing to be replayed on remount to reconstruct the btree image in
cache.

2) The Update in Place Method

An alternate method of atomically updating our btree leaf node is to
write it to disk twice: the first time along with a commit block that
says we intend to overwrite the leaf node with the data part of the
transaction, and the second write does the actual overwrite.  If we
crash before completing the second write, replay will load the image
of the btree block from the commit transaction into cache and we are
ready to continue where we left off.

Like the clone method, the update in place method constructs a
transaction, tries to link it into an existing forward log chain, or
falls back to writing the commit block location to a known location if
it has to, then stages the transaction for commit, committing the
predecessor transaction in the forward chain if there is one.  The
overwrite transfer can be submitted as soon as the commit transaction
has completed, which might consist of one or two bio transfers depending
on whether it was possible to allocate the transaction contiguously or
not.

It is possible that the updated leaf block may be needed as the base for
applying logical updates to reconstruct the cached version of the leaf
as a consequence of some future transaction, but the future transaction
must be able to rely on a known state of the leaf if it wants to log
logical changes against it.  Therefore the future transaction may have
to wait on the update in place transaction to complete as well, to
guarantee that the desired leaf state can be reconstructed.

Any future update in place of the same block has to wait for the
overwrite transfer to complete before submitting its own overwrite
transfer, otherwise the second overwrite may race ahead of the first
creating a stale disk block.

The whole update in place transaction can be freed as soon as the
overwrite transfer completes.

Comparison of the methods

Both methods are very efficient ways to achieve atomic commit.  The
update in place method has the advantage of not perturbing the original
disk layout, possibly reducing fragmentation.  The update clone method
has the advantage of one less write operation.  Tux3 will provide a
mount option to specify which method to use.

A related technique is atomic append.  A whole range of data blocks can
be written out along with a commit block in a single bio transfer, or
a small number of bio transfers in one transaction, along with a logical
record in the commit block to say which file the blocks are to be
appended, in the form of a logical update in the commit block which will
eventually result in the addition of one or more extents to the file
index, update of the size and mtime, and removal of the new data blocks
from the free map.  Thus, a file write of considerable size can be
atomically committed with a single bio transfer.

Regards,

Daniel