Design note: Btree index block life cycle

Sun Mar 1 18:43:43 PST 2009

(sent again with word wrapping)

Hi Matt,

This note examines specifics of the Tux3 btree index block life cycle, including caching, redirection, flushing and reconstruction during log replay.  This addresses some points that you raised during our earlier discussion, for which my response at that time could fairly be described as hand waving.  One such point was potential complexity compared to working with a more "uniform" btree design like the one you have adopted for Hammer.

Hirofumi is CCed because he has bravely volunteered to complete the implementation of this part of the Tux3 atomic commit mechanism for userspace and kernel, thus providing me with the necessary motivation to state the details precisely.

I think that we can see at this point that the complexity required to implement the Tux3 atomic commit strategy is well bounded, especially as it is now mostly implemented and can be seen not to amount to a great deal of code.  Mind you, some of it has not yet seen the light of day.  For example, I still have not addressed all the details of allocation bitmap replay, which will require an additional, logical replay pass not described here.  Oh well, that will be another note, and hopefully all of this will be up and running fairly soon.

By the way, congratulation on your apparent success with Hammer.  I hope to foment a movement to port it to Linux, as it would seem to address a niche that is not well covered in Penguin land and is unlikely to be in the forseeable future, by my project or any other I know of, which is to say: cluster replication and continuous fine grained delta logging for high capacity servers.

Natural metadata position

For file index blocks, the "natural" position is near the file's inode table block, and near the beginning of the file data blocks.  Roughly speaking, we would like to place index blocks on the same track as both the associated inode table block and the first few file data blocks, so that the disk drive will pick these up together into its track cache when it first seeks to that track.  Similarly, we want each inode table index block to lie near the beginning of the group of inode table leaf blocks it references.  This is the ideal position for read access.

On the other hand, the ideal position for writing is wherever file data is currently being written.  Thus, the ideal metadata block position for reading conflicts with that for writing.  We resolve this conflict using a scheme I call metadata "kiting".

Pinned dirty metadata, or "Kiting"

As an optimization, dirty btree index blocks are not written to disk at each delta cycle, but instead flushed in relatively infrequent log rollup cycles.  This strategy is primarily intended to reduce the number of out of line seeks during writing, while also writing the metadata close to its natural position for fast read access.  Instead of writing a dirty metadata block to disk at each delta, we log details of the changes that have been made to it since the last time it was flushed, and write out the log block as part of the associated delta, at a location near the other blocks of the delta.  It does not matter if this is far from the referenced metadata blocks, because the log block will only ever be read on startup, and then only if no log rollup was done before shutdown.

A dirty metadata block held for flushing in this way is said to be pinned, because it records the only references to at least some of its child blocks.  It may not be evicted from cache until after we either flush the block to disk or log enough information to reconstruct the dirty block in cache on log replay.  I coined the term "kiting" for this scheme: such dirty metadata blocks normally live "up there in cache" and only "come down to earth" at relatively rare log flush intervals.

Besides saving seeks on read and write, kiting should reduce the total number of metadata block updates.  The realized savings may be anywhere from nothing to a factor of several hundred, or roughly the branching factor of a btree index block.  We will eventually measure this effect under various filesystem loads.

Block Redirect

A btree block intially loaded into cache as part of a btree probe for read access or leaf update is initially marked clean.  If a btree leaf is to be updated, for example by storing new inode attributes or new file data extents, then unless it is already dirty in the current delta, it is copied to a newly allocated position in the Tux3 volmap (formerly bdev, aka buffer) cache and marked dirty in the current delta, to be flushed as part of the current delta.  However, its parent, a btree index node, will not be flushed immediately, even though it must be altered to record the new position of the leaf block.  The parent is changed and redirected in cache, and its parent as well, right up the first block in the btree cursor patch that is already dirty in the current delta, which may be the root of the btree.

Thus, altering a file data leaf may entail redirecting the entire path from the leaf to the root of the btree.  If the root is redirected, then the cached inode must be marked dirty so that its data btree attribute will be updated in the current delta.  This may in turn cause the entire path from the affected inode table block to the root of the inode table btree to be redirected when the affected inode is flushed during delta staging.  So a single change may cause a minor storm of volume map cache activity.  However, it does not immediately cause a great deal of write activity, because changes to all the btree index nodes affected are recorded in a log block and the dirty index block itself is merely held dirty in cache until the next log rollup flush cycle some time in the future.   (Note: btree leaf nodes are flushed as part of the current delta, unlike btree index nodes.  This is due to the inherently greater complexity of logging and replaying leaf node operations, and because it is much easier to place leaf nodes close to their associated data without introducing much extra read or write seeking.)

After redirecting a chain of blocks in a btree cursor path, the original clean versions may be discarded, since they will no longer be accessed in cache.  (When we allow multiple deltas in the update pipeline, this rule will be slightly elaborated to accommodate redirections from in-flight dirty blocks in earlier deltas.)  Currently, the only case where we need to worry about exactly when to discard a clean metadata block is during bitmap flushing, which might trigger a read of an out of cache bitmap block, and thus a probe of at least part of the access path that we are currently modifying.  Locking strategy to handle this unusual case is interesting, to say the least.

When to redirect?

It is necessary to redirect btree nodes when first dirtied, rather than at delta staging time because it is only at the time of leaf dirtying that we have the entire path from the root of the btree to the leaf available in the cursor path, and thus are able to easily locate and redirect the successive parent blocks.  Finding the parent later when we only know that some volmap block is dirty would require some means of tracking the parent, an additional level of complexity compared to doing the redirect at the time we know the parent chain.  Additionally, we not only know the parent from the cursor, we have it locked, which will be quite helpful as we refine or SMP locking strategy in the direction of finer granularity.

Log output records

To guarantee that all changes to a pinned btree index node can be accurately reconstructed if it becomes necessary to replay the log, the following types of change are logged to log blocks that will be committed as part of the delta in progress:

   insert (logical key, parent block, child block)
   update (logical key, parent block, child block)
   delete (logical key, parent block)
   redirect (old block, new block)
   split (logical key, block, new block)
   merge (logical key, block, old block)

Log replay

On log replay, each logged operation must be replayed in cache, yielding exactly the same result as the original operation.  One interesting subtlety is that insert, update or delete operations are recorded against redirected blocks, which have not necessarily ever been written to disk.  In fact, these operations must be replayed against the contents of predecessor versions of the block.  The redirect records in the delta log chain, when replayed in proper order with other operations affecting a given tree block, should produce the correct dirty volmap block in every case.  We can verify this with a checksum, which suggests the following additional form of log output record:

   check (checksum, block)

which will trigger failure if a dirty metadata block does not match after being reconstructed at log replay time.

Log Rollup

An excessively long log chain could cause long replay times on restart (planned or unplanned) and might consume an excessive amount of cache.  Therefore from time to time a log rollup flush cycle is performed, triggered by current cache conditions, block device bandwidth availability, and length of the log.  Log rollup consists of moving all dirty pinned metadata blocks to the current delta writeout list.  After the delta has committed, these log blocks can be marked free in the allocation map.