Of Episodes, Phases and Forks

Wed Oct 22 21:26:20 PDT 2008

Hi Matt,

I copied you on this post for comment by way of fullfilling a promise I 
made a couple of months ago to describe the Tux3 atomic commit strategy 
in greater detail.

In a previous post I described the physical write pattern we wish to 
produce during a particularly common kind of heavy write load[1], but 
did not explain how that write pattern actually implements an atomic 
commit or how Tux3 will go about generating it.  Today I will address 
that omission by describing in some detail what goes on behind the 
scenes, and how the desired ACID compliant behavior is to be achieved.

A layered view

A Tux3 filesystem is a tree of trees, the apex of which is an btree 
index of inode table blocks from which file data btrees descend.

Tux3 takes a layered view of filesystem access: the top layer is 
the "active" view, consisting of inodes and associated data blocks 
cached in the page cache.  The bottom layer is the "stable" filesystem 
image durably recorded on disk.  In between lies a layer of dirty 
blocks being transferred to disk.

With the exception of direct IO, all changes to a Tux3 filesystem take 
place in cache.  The active view is the only one visible to userspace.  
It consists of clean blocks that have not been modified since being 
loaded from disk and dirty blocks that were either created in cache or 
loaded from disk and then altered.  Also in cache, Tux3 maintains 
metadata block images that specify the mapping between logically 
indexed buffers in page cache and physical disk locations.  When Tux3 
wishes to change such a mapping, it does it by changing cached metadata 
blocks and does not change the underlying images on disk until some 
time later.

Atomic commit

The "stable" image of a filesystem tree on disk is always a recording of 
some past state of the active view.  The stable image will become the 
active view of the filesystem tree in the event of unexpected restart.

The goal of atomic commit is to create a new stable image on disk 
corresponding to a more recent state of the active view, without in any 
way altering the existing stable image until the new stable image has 
been completely recorded on disk.  (Unlike other versioning filesystem 
designs, Tux3 only needs to maintain a single stable image of a 
filesystem tree at any time because versioning information is 
represented at the leaves of its filesystem tree, as opposed to 
generating multiple tree roots.)

The difference between one stable state of a filesystem on disk and the 
next is called a "phase".  Each phase transfers some set of dirty 
blocks to disk at locations marked as free space in the previous stable 
tree.  Filesystem operations respect phase boundaries: all buffers 
modified by any given filesystem operation must lie entirely within a 
single phase.  It follows that if a phase transfer is atomic then all 
filesystem operations in it are also atomic.

A phase is thus the unit of atomic updating in Tux3.  A filesystem sync 
or a file fsync simply causes a phase transition, and the calling 
process waits for that phase to become the new stable phase before 
returning to its caller.  This model permits Tux3 the luxury of 
ignoring the potentially intricate relationships between buffers within 
a phase (for example, dirty dirent, inode table, allocation bitmap and 
data blocks all resulting from a given filesystem operation) and 
concentrate on the easier task of partitioning dirty blocks accurately 
into phases.

We can re-express the "no alteration of the stable state" rule simply in 
terms of phases:

   No overwrite rule: "No block may be overwritten unless it is
   marked as free in the allocation map of the previous phase"

So if a block belonging to the stable filesystem tree image needs to be 
updated, a new physical location must be found for it, and its parent 
block must be modified to reference the new location.  This is 
called "physical remapping".

The same rule applies recursively to the parent block, however Tux3 does 
not remap recursively all the way to the root of the filesystem tree as 
in my original Phase Tree design[2].  Instead, it encodes in the commit 
block of the update transaction a promise to modify the parent block on 
disk some time in the future, while actually making the modification 
only to the cached version of the parent block.

Transferring a phase to disk

A phase is transferred to disk in a series of update transactions, which 
are not themselves atomic and may be completed in any order, but must 
all be completed before the phase completes.

Each update transaction has a commit block and zero or more member 
blocks.  The commit block contains a list of promises with respect to 
the parents of member blocks.  Each promise is a triple:

   { body extent, parent address, offset }

The list of commit block promises amounts to a recipe for reconstructing 
the dirty cache state of a Tux3 filesystem, working from the most 
recent images of the respective parent blocks in the stable filesystem 
tree.

Phase transition

Each time Tux3 begins a new filesystem operation, for example a rename 
as invoked by the VFS:

   http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2539

it decides whether it should begin a new phase, based on several 
factors: cache pressure, elapsed time and congestion of underlying 
block device queues.  After possibly starting a new phase, the 
filesystem operation continues, adding all blocks that it modifies to 
the current, "active" phase.

Starting a new phase requires incrementing the phase counter in the 
cached filesystem superblock and flushing all dirty inodes.  Each dirty 
block buffer records the phase in which it was dirtied.  (It can only 
be dirtied in single phase in accordance with the no overwrite rule.)  
If a filesystem operation wishes to alter a block that is dirty in a 
previous phase, then a "fork" operation is performed first whereby the 
dirty block buffer is removed from the buffer cache and added to a list 
of forked buffers attached to the previous phase.  A copy of the buffer 
is inserted in its place, which may then be freely altered to update 
the active filesystem view.

Forking blocks as necessary instead of waiting on phase completion 
avoids filesystem access stalls.  The only time a Tux3 filesystem 
should stall is when cache memory runs out due to underlying block 
device congestion.

The case of memory mapped blocks is interesting in that it initially 
appears as though enforcing the no overwrite rule is impossible without 
resorting to techniques such as page table entry copy protection, but 
this turns out to be unnecessary, because only the state of metadata 
buffers affects filesystem consistency, and metadata is never memory 
mapped.  Isolation between snapshots is also not a problem, because a 
snapshot includes only data already transferred to disk, which cannot 
be further affected by a write to a memory map.

Phase completion

When all update transactions for a particular phase have completed (that 
is, endio has been called for all associated bio transfers) then the 
phase is completed by adding a phase commit block to a chain of 
completed phases.  The phase commit block in turn references a chain of 
completed update transactions.  On unexpected restart, this has the 
effect of automatically discarding all update transactions belonging to 
an incomplete phase, by virtue of never transferring the start of the 
update chain to disk.  Thus Tux3 automatically falls back to the last 
completed phase, which is just the behavior we want, since the 
filesystem will always be consistent and will always properly respect 
sync and fsync semantics.

Any tasks waiting on completion of the phase due to calling sync or 
fsync are unblocked and may continue, safe in the assurance that the 
expected semantics have been enforced.

For the time being, Tux3 implements stronger than necessary semantics 
for fsync by treating it identically to sync, which in turn implements 
stronger semantics on Linux than Posix specifies.  In a future 
optimization, Tux3 may relax fsync semantics to avoid flushing dirty 
buffers belonging to files other than the fsync target.

Freeing blocks safely

Tux3 may free a block in one of two ways:

   1) Explicit free via inode truncate or file deletion.

   2) Implicit free by remapping a block to a new physical location.

The first affects the active phase and the second occurs just after 
phase transition during inode flushing.  In either case, the freed 
blocks are not marked free in the cached allocation bitmaps 
immediately, but placed on a list to be marked free on phase 
completion.  This avoids the possibility that a freed block may be 
reallocated and overwritten in the same phase, violating the no 
overwrite rule.

Cache state reconstruction

The state of a Tux3 stable filesystem tree includes two major elements:

  * A "base" filesystem tree recorded literally

  * A set of "promises" made in transaction commit blocks

The "real" stable tree consists of the base tree with all promises 
realized.  The process of realizing promises is called a "rollup" and 
is done by committing current versions of cached metadata blocks to 
disk.  The parent blocks of rolled up metadata blocks are not modified 
directly, but once again, via promises recorded in commit blocks.  So 
Tux3 never actually rolls up the stable filesystem tree completely.  
This behavior is analogous to the concept of a journal, which always 
contains part of the "real" filesystem image except immediately after a 
journal flush.

On unmount Tux3 just does a sync, but not a rollup, and completes the 
currently active phase.  On restart, Tux3 reconstructs the cache state 
consisting of dirty metadata blocks that differ from the stable base 
tree according to the promises made in commit blocks.  This requires 
reading base metadata blocks into cache and inserting block pointers, 
updating allocation bitmaps and setting inode attributes in accordance 
with the commit block promises.

The process of reconstructing cache state on remount is identical for 
remount after either planned unmount or unexpected interruption.  This 
is a Very Good Thing in that Tux3's recovery code will be exercised on 
every remount, which should help to address the typical problem where 
recovery code is poorly tested except for unplanned field testing, at 
which time logical flaws may be discovered at the expense of serious 
data loss.

Metadata rollups may be performed at any time in Tux3, the optimal 
frequency lying somewhere between too often, where the number of 
metadata writes and out of line seeks increases perceptibly, and too 
rarely, where large numbers of dirty metadata blocks increase cache 
pressure and reconstruction of dirty cache state at remount time might 
cause a measurable delay.  A rollup frequency of about once per second 
under heavy write load seems about right as I speculated in my earlier 
posting, but can be exposed as a performance tunable for good measure.

The interval between two metadata rollups is called an "episode".  The 
episode mechanism in Tux3 exists purely to reduce metadata traffic 
versus recursive copy on write of a filesystem tree as employed in ZFS 
and Btrfs.  Episodic rollup was assumed in my earlier analysis of 
atomic commit performance, where I estimated that Tux3 would attain 
something between 90% and 99% of available disk bandwidth during untar 
of a large directory tree.

Introducing metadata checksums

The rules associated with accurate metadata rollup being fairly subtle, 
it is a virtual certainty that implementation errors will creep in.  To 
catch these errors early, we will checksum each metadata block affected 
by a commit block update promise and store the checksum in the commit 
block so that it can be verified on restart during metadata cache 
reconstruction.

Speculative recovery

It may prove to be practical to recover parts of an incomplete phase on 
unexpected restart.  For example, if a file is only appended to and not 
linked or unlinked in a phase, then Tux3 could deduce that some data 
blocks appearing in update transactions for an incomplete phase can be 
safely added.  This sort of aggressive recovery could be controlled by 
a mount option, or it could be provided automatically and safely by 
generating a new volume version at the point of the most recently 
completed phase, and providing the "probable" additional data on a 
sibling branch.  Such speculative recovery could possibly prove useful, 
for example in the case of log files.

Implementation

The atomic commit method described here will now be implemented in 
userspace as the last major coding project before porting Tux3 to 
kernel.  While the above description may sound somewhat complex, so is 
classic journalling.[3]

The Tux3 approach has a number of advantages versus journalling.  For 
one thing, atomic commit of data as well as metadata is considerably 
less expensive using a redirect strategy than seeking to a journal, 
writing data, and later rewriting the data to its final destination.  
The promise/rollup strategy for updating parent pointers turns data 
block remapping into an inexpensive operation, with the remaining cost 
being merely the "opportunity cost" of holding onto space occupied by 
data blocks for a while in order to obey the no overwrite rule.  This 
cost applies only to the rewrite case, but since rewrite performance is 
important, I will return to the question of how Tux3 will accomplish 
this efficiently in a future note.

Due to atomic commit of data being relatively cheap in Tux3, there is no 
need for a "revoke" facility such as that employed by Ext3 to prevent 
data blocks from being re-allocated as metadata and subsequently being 
overwritten by journal replay[3].

Another advantage is, the size of an atomic transaction is limited only 
by the amount of memory available to cache metadata and the amount of 
free space on disk, not by the size of the journal.  This means that we 
do not have to contend with the possibility of journal deadlock, where 
an uncommitted transaction is larger than the remaining unpinned space 
in the journal.  So long as there is sufficient free space on the 
filesystem, the transaction can be committed, and otherwise it can be 
fairly aborted with ENOSPC.

Some other aspects of atomic commit via the method described here are 
essentially the same as for journalling.  For example, copy on write of 
cached buffers must be performed in both methods in order to avoid 
stalling filesystem access on transaction commit.  The requirements for 
avoiding memory deadlock are also largely the same.

Versus the recursive copy on write strategy employed by ZFS and Btrfs, 
Tux3 does not need to perform recursive copying, but instead has to 
worry about the details of rollup episodes, roughly an even trade in 
terms of complexity.  Tux3 does not attempt to have its atomic commit 
strategy double as a method of retaining snapshots, which compared to 
ZFS, eliminates the need to maintain a dead block list, and compared to 
Btrfs, avoids the need to maintain physical block reference counts.

I expect a serviceable implementation of the atomic strategy described 
here to translate into somewhere between 500 and 1,000 lines of code, 
in other words, somewhat less than you might think considering all the 
gory details that must be dealt with.

[1] "Thinking about Syncing"
    http://kerneltrap.org/mailarchive/tux3/2008/10/12/3618934/thread

[2] "The phase tree algorithm"

http://web.archive.org/web/20010210210054/http://innominate.org/~phillips/tux2/phase.tree.algorithm.txt

[3] "EXT3, Journalling Filesystem"

www.linux.genova.it/sections/04_cosa_facciamo/02_corsi_linux/98_novembre_2002/03_DesignImplementationEXT3.pdf

Regards,

Daniel