Of Episodes, Phases and Forks
Daniel Phillips
phillips at phunq.net
Wed Oct 22 21:26:20 PDT 2008
Hi Matt,
I copied you on this post for comment by way of fullfilling a promise I
made a couple of months ago to describe the Tux3 atomic commit strategy
in greater detail.
In a previous post I described the physical write pattern we wish to
produce during a particularly common kind of heavy write load[1], but
did not explain how that write pattern actually implements an atomic
commit or how Tux3 will go about generating it. Today I will address
that omission by describing in some detail what goes on behind the
scenes, and how the desired ACID compliant behavior is to be achieved.
A layered view
A Tux3 filesystem is a tree of trees, the apex of which is an btree
index of inode table blocks from which file data btrees descend.
Tux3 takes a layered view of filesystem access: the top layer is
the "active" view, consisting of inodes and associated data blocks
cached in the page cache. The bottom layer is the "stable" filesystem
image durably recorded on disk. In between lies a layer of dirty
blocks being transferred to disk.
With the exception of direct IO, all changes to a Tux3 filesystem take
place in cache. The active view is the only one visible to userspace.
It consists of clean blocks that have not been modified since being
loaded from disk and dirty blocks that were either created in cache or
loaded from disk and then altered. Also in cache, Tux3 maintains
metadata block images that specify the mapping between logically
indexed buffers in page cache and physical disk locations. When Tux3
wishes to change such a mapping, it does it by changing cached metadata
blocks and does not change the underlying images on disk until some
time later.
Atomic commit
The "stable" image of a filesystem tree on disk is always a recording of
some past state of the active view. The stable image will become the
active view of the filesystem tree in the event of unexpected restart.
The goal of atomic commit is to create a new stable image on disk
corresponding to a more recent state of the active view, without in any
way altering the existing stable image until the new stable image has
been completely recorded on disk. (Unlike other versioning filesystem
designs, Tux3 only needs to maintain a single stable image of a
filesystem tree at any time because versioning information is
represented at the leaves of its filesystem tree, as opposed to
generating multiple tree roots.)
The difference between one stable state of a filesystem on disk and the
next is called a "phase". Each phase transfers some set of dirty
blocks to disk at locations marked as free space in the previous stable
tree. Filesystem operations respect phase boundaries: all buffers
modified by any given filesystem operation must lie entirely within a
single phase. It follows that if a phase transfer is atomic then all
filesystem operations in it are also atomic.
A phase is thus the unit of atomic updating in Tux3. A filesystem sync
or a file fsync simply causes a phase transition, and the calling
process waits for that phase to become the new stable phase before
returning to its caller. This model permits Tux3 the luxury of
ignoring the potentially intricate relationships between buffers within
a phase (for example, dirty dirent, inode table, allocation bitmap and
data blocks all resulting from a given filesystem operation) and
concentrate on the easier task of partitioning dirty blocks accurately
into phases.
We can re-express the "no alteration of the stable state" rule simply in
terms of phases:
No overwrite rule: "No block may be overwritten unless it is
marked as free in the allocation map of the previous phase"
So if a block belonging to the stable filesystem tree image needs to be
updated, a new physical location must be found for it, and its parent
block must be modified to reference the new location. This is
called "physical remapping".
The same rule applies recursively to the parent block, however Tux3 does
not remap recursively all the way to the root of the filesystem tree as
in my original Phase Tree design[2]. Instead, it encodes in the commit
block of the update transaction a promise to modify the parent block on
disk some time in the future, while actually making the modification
only to the cached version of the parent block.
Transferring a phase to disk
A phase is transferred to disk in a series of update transactions, which
are not themselves atomic and may be completed in any order, but must
all be completed before the phase completes.
Each update transaction has a commit block and zero or more member
blocks. The commit block contains a list of promises with respect to
the parents of member blocks. Each promise is a triple:
{ body extent, parent address, offset }
The list of commit block promises amounts to a recipe for reconstructing
the dirty cache state of a Tux3 filesystem, working from the most
recent images of the respective parent blocks in the stable filesystem
tree.
Phase transition
Each time Tux3 begins a new filesystem operation, for example a rename
as invoked by the VFS:
http://lxr.linux.no/linux+v2.6.27/fs/namei.c#L2539
it decides whether it should begin a new phase, based on several
factors: cache pressure, elapsed time and congestion of underlying
block device queues. After possibly starting a new phase, the
filesystem operation continues, adding all blocks that it modifies to
the current, "active" phase.
Starting a new phase requires incrementing the phase counter in the
cached filesystem superblock and flushing all dirty inodes. Each dirty
block buffer records the phase in which it was dirtied. (It can only
be dirtied in single phase in accordance with the no overwrite rule.)
If a filesystem operation wishes to alter a block that is dirty in a
previous phase, then a "fork" operation is performed first whereby the
dirty block buffer is removed from the buffer cache and added to a list
of forked buffers attached to the previous phase. A copy of the buffer
is inserted in its place, which may then be freely altered to update
the active filesystem view.
Forking blocks as necessary instead of waiting on phase completion
avoids filesystem access stalls. The only time a Tux3 filesystem
should stall is when cache memory runs out due to underlying block
device congestion.
The case of memory mapped blocks is interesting in that it initially
appears as though enforcing the no overwrite rule is impossible without
resorting to techniques such as page table entry copy protection, but
this turns out to be unnecessary, because only the state of metadata
buffers affects filesystem consistency, and metadata is never memory
mapped. Isolation between snapshots is also not a problem, because a
snapshot includes only data already transferred to disk, which cannot
be further affected by a write to a memory map.
Phase completion
When all update transactions for a particular phase have completed (that
is, endio has been called for all associated bio transfers) then the
phase is completed by adding a phase commit block to a chain of
completed phases. The phase commit block in turn references a chain of
completed update transactions. On unexpected restart, this has the
effect of automatically discarding all update transactions belonging to
an incomplete phase, by virtue of never transferring the start of the
update chain to disk. Thus Tux3 automatically falls back to the last
completed phase, which is just the behavior we want, since the
filesystem will always be consistent and will always properly respect
sync and fsync semantics.
Any tasks waiting on completion of the phase due to calling sync or
fsync are unblocked and may continue, safe in the assurance that the
expected semantics have been enforced.
For the time being, Tux3 implements stronger than necessary semantics
for fsync by treating it identically to sync, which in turn implements
stronger semantics on Linux than Posix specifies. In a future
optimization, Tux3 may relax fsync semantics to avoid flushing dirty
buffers belonging to files other than the fsync target.
Freeing blocks safely
Tux3 may free a block in one of two ways:
1) Explicit free via inode truncate or file deletion.
2) Implicit free by remapping a block to a new physical location.
The first affects the active phase and the second occurs just after
phase transition during inode flushing. In either case, the freed
blocks are not marked free in the cached allocation bitmaps
immediately, but placed on a list to be marked free on phase
completion. This avoids the possibility that a freed block may be
reallocated and overwritten in the same phase, violating the no
overwrite rule.
Cache state reconstruction
The state of a Tux3 stable filesystem tree includes two major elements:
* A "base" filesystem tree recorded literally
* A set of "promises" made in transaction commit blocks
The "real" stable tree consists of the base tree with all promises
realized. The process of realizing promises is called a "rollup" and
is done by committing current versions of cached metadata blocks to
disk. The parent blocks of rolled up metadata blocks are not modified
directly, but once again, via promises recorded in commit blocks. So
Tux3 never actually rolls up the stable filesystem tree completely.
This behavior is analogous to the concept of a journal, which always
contains part of the "real" filesystem image except immediately after a
journal flush.
On unmount Tux3 just does a sync, but not a rollup, and completes the
currently active phase. On restart, Tux3 reconstructs the cache state
consisting of dirty metadata blocks that differ from the stable base
tree according to the promises made in commit blocks. This requires
reading base metadata blocks into cache and inserting block pointers,
updating allocation bitmaps and setting inode attributes in accordance
with the commit block promises.
The process of reconstructing cache state on remount is identical for
remount after either planned unmount or unexpected interruption. This
is a Very Good Thing in that Tux3's recovery code will be exercised on
every remount, which should help to address the typical problem where
recovery code is poorly tested except for unplanned field testing, at
which time logical flaws may be discovered at the expense of serious
data loss.
Metadata rollups may be performed at any time in Tux3, the optimal
frequency lying somewhere between too often, where the number of
metadata writes and out of line seeks increases perceptibly, and too
rarely, where large numbers of dirty metadata blocks increase cache
pressure and reconstruction of dirty cache state at remount time might
cause a measurable delay. A rollup frequency of about once per second
under heavy write load seems about right as I speculated in my earlier
posting, but can be exposed as a performance tunable for good measure.
The interval between two metadata rollups is called an "episode". The
episode mechanism in Tux3 exists purely to reduce metadata traffic
versus recursive copy on write of a filesystem tree as employed in ZFS
and Btrfs. Episodic rollup was assumed in my earlier analysis of
atomic commit performance, where I estimated that Tux3 would attain
something between 90% and 99% of available disk bandwidth during untar
of a large directory tree.
Introducing metadata checksums
The rules associated with accurate metadata rollup being fairly subtle,
it is a virtual certainty that implementation errors will creep in. To
catch these errors early, we will checksum each metadata block affected
by a commit block update promise and store the checksum in the commit
block so that it can be verified on restart during metadata cache
reconstruction.
Speculative recovery
It may prove to be practical to recover parts of an incomplete phase on
unexpected restart. For example, if a file is only appended to and not
linked or unlinked in a phase, then Tux3 could deduce that some data
blocks appearing in update transactions for an incomplete phase can be
safely added. This sort of aggressive recovery could be controlled by
a mount option, or it could be provided automatically and safely by
generating a new volume version at the point of the most recently
completed phase, and providing the "probable" additional data on a
sibling branch. Such speculative recovery could possibly prove useful,
for example in the case of log files.
Implementation
The atomic commit method described here will now be implemented in
userspace as the last major coding project before porting Tux3 to
kernel. While the above description may sound somewhat complex, so is
classic journalling.[3]
The Tux3 approach has a number of advantages versus journalling. For
one thing, atomic commit of data as well as metadata is considerably
less expensive using a redirect strategy than seeking to a journal,
writing data, and later rewriting the data to its final destination.
The promise/rollup strategy for updating parent pointers turns data
block remapping into an inexpensive operation, with the remaining cost
being merely the "opportunity cost" of holding onto space occupied by
data blocks for a while in order to obey the no overwrite rule. This
cost applies only to the rewrite case, but since rewrite performance is
important, I will return to the question of how Tux3 will accomplish
this efficiently in a future note.
Due to atomic commit of data being relatively cheap in Tux3, there is no
need for a "revoke" facility such as that employed by Ext3 to prevent
data blocks from being re-allocated as metadata and subsequently being
overwritten by journal replay[3].
Another advantage is, the size of an atomic transaction is limited only
by the amount of memory available to cache metadata and the amount of
free space on disk, not by the size of the journal. This means that we
do not have to contend with the possibility of journal deadlock, where
an uncommitted transaction is larger than the remaining unpinned space
in the journal. So long as there is sufficient free space on the
filesystem, the transaction can be committed, and otherwise it can be
fairly aborted with ENOSPC.
Some other aspects of atomic commit via the method described here are
essentially the same as for journalling. For example, copy on write of
cached buffers must be performed in both methods in order to avoid
stalling filesystem access on transaction commit. The requirements for
avoiding memory deadlock are also largely the same.
Versus the recursive copy on write strategy employed by ZFS and Btrfs,
Tux3 does not need to perform recursive copying, but instead has to
worry about the details of rollup episodes, roughly an even trade in
terms of complexity. Tux3 does not attempt to have its atomic commit
strategy double as a method of retaining snapshots, which compared to
ZFS, eliminates the need to maintain a dead block list, and compared to
Btrfs, avoids the need to maintain physical block reference counts.
I expect a serviceable implementation of the atomic strategy described
here to translate into somewhere between 500 and 1,000 lines of code,
in other words, somewhat less than you might think considering all the
gory details that must be dealt with.
[1] "Thinking about Syncing"
http://kerneltrap.org/mailarchive/tux3/2008/10/12/3618934/thread
[2] "The phase tree algorithm"
http://web.archive.org/web/20010210210054/http://innominate.org/~phillips/tux2/phase.tree.algorithm.txt
[3] "EXT3, Journalling Filesystem"
www.linux.genova.it/sections/04_cosa_facciamo/02_corsi_linux/98_novembre_2002/03_DesignImplementationEXT3.pdf
Regards,
Daniel
More information about the Kernel
mailing list