[Tux3] Comparison to Hammer fs design

Fri Jul 25 19:10:33 PDT 2008

:>     so I am adding him to the To:   Dan, feel free to post this on your Tux
:>     groups if you want.
:
:How about a cross-post?

    I don't think it will work, only subscribers can post to the DFly groups,
    but we'll muddle through it :-)  I will include the whole of the previous
    posting so the DFly groups see the whole thing, if you continue to get
    bounces.

    I believe I have successfully added you as an 'alias address' to the
    DragonFly kernel list so you shouldn't get bounced if you Cc it now.

:Yes, that is the main difference indeed, essentially "log everything" vs
:"commit" style versioning.  The main similarity is the lifespan oriented
:version control at the btree leaves.

    Reading this and a little more that you describe later let me make
    sure I understand the forward-logging methodology you are using.
    You would have multiple individually-tracked transactions in
    progress due to parallelism in operations initiated by userland and each
    would be considered committed when the forward-log logs the completion
    of that particular operation?

    If the forward log entries are not (all) cached in-memory that would mean
    that accesses to the filesystem would have to be run against the log
    first (scanning backwards), and then through to the B-Tree?  You
    would solve the need for having an atomic commit ('flush groups' in
    HAMMER), but it sounds like the algorithmic complexity would be
    very high for accessing the log.

    And even though you wouldn't have to group transactions into larger
    commits the crash recovery code would still have to implement those
    algorithms to resolve directory and file visibility issues.  The problem
    with namespace visibility is that it is possible to create a virtually
    unending chain of separate but inter-dependant transactions which either
    all must go, or none of them.  e.g. creating a, a/b, a/b/c, a/b/x, a/b/c/d,
    etc etc.  At some point you have to be able to commit so the whole mess
    does not get undone by a crash, and many completed mini-transactions
    (file or directory creates) actually cannot be considered complete until
    their governing parent directories (when creating) or children (when
    deleting) have been committed.  The problem can become very complex.

    Last question here:  Your are forward-logging high level operations.
    You are also going to have to log meta-data (actual B-Tree manipulation)
    commits in order to recover from a crash while making B-Tree
    modifications.  Correct?  So your crash recovery code will have to handle
    both meta-data undo and completed and partially completed transactions.
    And there will have to be a tie-in between the meta-data commits and
    the transactions so you know which ones have to be replayed.  That
    sounds fairly hairy.  Have you figured out how you are doing to do that?

:Once I get down to the leaf level I binary search on logical address in
:the case of a file index btree or on version in the case of an inode
:table block, so this cost is still Log(N) with a small k.  For a
:heavily versioned inode or file region this might sometimes result in
:an overflow block or two that has to be linearly searched which is not
:a big problem so long as it is a rare case, which it really ought to be
:for the kinds of filesystem loads I have seen.  A common example of a
:worst case is /var/log/messages, where the mtime and size are going to
:change in pretty much every version, so if you have hourly snapshots
:and hold them for three months it adds up to about 2200 16 byte inode
:table attributes to record, about 8 inode table leaf blocks.  I really
:do not see that as a problem.  If it goes up to 100 leaf blocks to
:search then that could be a problem.

    I think you can get away with this as long as you don't have too many
    snapshots, and even if you do I noticed with HAMMER that only a small
    percentage of inodes have a large number of versions associated with
    them from normal production operation.  /var/log/messages
    is an excellent example of that.  Log files were effected the most though
    I also noticed that very large files also wind up with multiple versions
    of the inode, such as when writing out a terrabyte-sized file. 

    Even with a direct bypass for data blocks (but not their meta-data,
    clearly), HAMMER could only cache so much meta-data in memory before
    it had to finalize the topology and flush the inode out.  A
    terrabyte-sized file wound up with about 1000 copies of the inode
    prior to pruning (one had to be written out about every gigabyte or so).

:The penultimate inode table index block tells me how many blocks a
:given inode lives in because several blocks will have the same inum
:key.  So the lookup algorithm for a massively versioned file becomes:
:1) read the first inode table block holding that inum; 2) read the last
:block with the same inum.  The latter operation only needs to consult
:the immediate parent index block, which is locked in the page cache at
:that point.

    How are you dealing with expansion of the logical inode block(s) as
    new versions are added?  I'm assuming you are intending to pack the
    inodes on the media so e.g. a 128-byte inode would only take up
    128 bytes of media space in the best case.  Multiple inodes would be
    laid out next to each other logically (I assume), but since the physical
    blocks are larger they would also have to be laid out next to each
    other physically within any given backing block.  Now what happens
    when one has to be expanded?

    I'm sure this ties into the forward-log but even with the best algorithms
    you are going to hit limited cases where you have to expand the inode.
    Are you just copying the inode block(s) into a new physical allocation
    then?

:It will take a little care and attention to the inode format,
:especially the inode header, to make sure that I can reliably do that
:first/last probe optimization for the "head" version, but it does seem
:worth the effort.  For starters I will just do a mindless linear walk
:of an "overflow" inode and get fancy if that turns out to be a problem.
:

    Makes sense.  I was doing mindless linear walks of the B-Tree element
    arrays for most of HAMMER's implementation until it was stabilized well
    enough that I could switch to a binary search.  And I might change it
    again to start out with a heuristical index estimation.

:>     I couldn't get away from having a delete_tid (the 'death version
:>     numbers').  I really tried :-)  There are many cases where one is only
:>     deleting, rather then overwriting.
:
:By far the most common case I would think.  But check out the versioned
:pointer algorithms.  Surprisingly that just works, which is not exactly
:obvious:
:
:   Versioned pointers: a new method of representing snapshots
:   http://lwn.net/Articles/288896/

    Yes, it makes sense.  If the snapshot is explicitly taken then you
    can store direct references and chain, and you wouldn't need a delete
    id in that case.  From that article though the chain looks fairly
    linear.  Historical access could wind up being rather costly.

:I was originally planning to keep all versions of a truncate/rewrite
:file in the same file index, but recently I realized that that is dumb
:because there will never be any file data shared by a successor version
:in that case.  So the thing to do is just create an entirely new
:versioned file data attribute for each rewrite, bulking up the inode
:table entry a little but greatly constraining the search for versions
:to delete and reducing cache pressure by not loading unrelated version
:data when traversing a file.

    When truncating to 0 I would agree with your assessment.  For the 
    topology you are using you would definitely want to use different
    file data sets.  You also have to deal with truncations that are not
    to 0, that might be to the middle of a file.  Certainly not a
    common case, but still a case that has to be coded for.  If you treat
    truncation to 0 as a special case you will be adding considerable
    complexity to the algorithm.

    With HAMMER I chose to keep everything in one B-Tree, whether historical
    or current.  That way one set of algorithms handles both cases and code
    complexity is greatly reduced.  It isn't optimal... large amounts of
    built up history still slow things down (though in a bounded fashion).
    In that regard creating a separate topology for snapshots is a good
    idea.

:>     Both numbers are then needed to 
:>     be able to properly present a historical view of the filesystem.
:>     For example, when one is deleting a directory entry a snapshot after
:>     that deletion must show the directory entry gone.
:
:...which happens in Tux3 courtesy of the fact that the entire block
:containing the dirent will have been versioned, with the new version
:showing the entry gone.  Here is one of two places where I violate my
:vow to avoid copying an entire block when only one data item in it
:changes (the other being the atime table).  I rely on two things to
:make this nice: 1) Most dirent changes will be logically logged and
:only rolled up into the versioned file blocks when there are enough to
:be reasonably sure that each changed directory block will be hit
:numerous times in each rollup episode.  (Storing the directory blocks
:in dirent-create order as in PHTree makes this very likely for mass
:deletes.)  2) When we care about this is usually during a mass delete,
:where most or all dirents in each directory file block are removed
:before moving on to the next block.

    This could wind up being a sticky issue for your implementation.
    I like the concept of using the forward-log but if you actually have
    to do a version copy of the directory at all you will have to update the
    link (or some sort of) count for all the related inodes to keep track
    of inode visibility, and to determine when the inode can be freed and
    its storage space recovered.

    Directories in HAMMER are just B-Tree elements.  One element per
    directory-entry.  There are no directory blocks.   You may want to
    consider using a similar mechanism.  For one thing, it makes lookups
    utterly trivial... the file name is hashed and a lookup is performed
    based on the hash key, then B-Tree elements with the same hash key
    are iterated until a match is found (usually the first element is the
    match).  Also, when adding or removing directory entries only the
    directory inode's mtime field needs to be updated.  It's size does not.

    Your current directory block method could also represent a problem for
    your directory inodes... adding and removing directory entries causing
    size expansion or contraction could require rolling new versions
    of the directory inode to update the size field.  You can't hold too
    much in the forward-log without some seriously indexing.  Another
    case to consider along with terrabyte-sized files and log files.

:>     Both numbers are 
:>     also needed to be able to properly prune out unwanted historical data
:>     from the filesystem.  HAMMER's pruning algorithm (cleaning out old  
:>     historical data which is no longer desired) creates holes in the
:>     sequence so once you start pruning out unwanted historical data
:>     the delete_tid of a prior record will not match the create_tid of the
:>     following one (historically speaking).
:
:Again, check out the versioned pointer algorithms.  You can tell what
:can be pruned just by consulting the version tree and the create_tids
:(birth versions) for a particular logical address.  Maybe the hang is
:that you do not organize the btrees by logical address (or inum in the
:case of the inode table tree).  I thought you did but have not read
:closely enough to be sure.

    Yes, I see.  Can you explain the versioned pointer algorithm a bit more,
    it looks almost like a linear chain (say when someone is doing a daily
    snapshot).  It looks great for optimal access to the HEAD but it doesn't
    look very optimal if you want to dive into an old snapshot. 

    For informational purposes: HAMMER has one B-Tree, organized using
    a strict key comparison.  The key is made up of several fields which
    are compared in priority order:

	localization	- used to localize certain data types together and
			  to separate pseudo filesystems created within
			  the filesystem.
	obj_id		- the object id the record is associated with.
			  Basically the inode number the record is 
			  associated with.

	rec_type	- the type of record, e.g. INODE, DATA, SYMLINK-DATA,
			  DIRECTORY-ENTRY, etc.

	key		- e.g. file offset

	create_tid	- the creation transaction id

    Inodes are grouped together using the localization field so if you
    have a million inodes and are just stat()ing files, the stat
    information is localized relative to other inodes and doesn't have to
    skip file contents or data, resulting in highly localized accesses
    on the storage media.

    Beyond that the B-Tree is organized by inode number and file offset.
    In the case of a directory inode, the 'offset' is the hash key, so
    directory entries are organized by hash key (part of the hash key is
    an iterator to deal with namespace collisions).

    The structure seems to work well for both large and small files, for
    ls -lR (stat()ing everything in sight) style traversals as well as 
    tar-like traversals where the file contents for each file is read.

    The create_tid is the creation transaction id.  Historical accesses are
    always 'ASOF' a particular TID, and will access the highest create_tid
    that is still <= the ASOF TID.  The 'current' version of the filesystem
    uses an asof TID of 0xFFFFFFFFFFFFFFFF and hence accesses the highest
    create_tid.

    There is also a delete_tid which is used to filter out elements that
    were deleted prior to the ASOF TID.  There is currently an issue with
    HAMMER where the fact that the delete_tid is *not* part of the B-Tree
    compare can lead to iterations for strictly deleted elements, verses
    replaced elements which will have a new element with a higher create_tid
    that the B-Tree can seek to directly.  In fact, at one point I tried
    indexing on delete_tid instead of create_tid, but created one hell of a
    mess in that I had to physically *move* B-Tree elements being deleted
    instead of simply marking them as deleted.

:Fair enough.  I have an entirely different approach to what you call
:mirroring and what I call delta replication.  (I reserve the term
:mirroring to mean mindless duplication of physical media writes.)  This
:method proved out well in ddsnap:
:
:   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257
:
:(Sorry about the massive URL, you can blame Linus for that;-)
:
:What I do in ddsnap is compute all the blocks that differ between two
:versions and apply those to a remote volume already holding the first
:of the two versions, yielding a replica of the second version that is
:logically but not physically identical.  The same idea works for a
:versioned filesystem: compute all the leaf data that differs between
:two versions, per inum, and apply the resulting delta to the
:corresponding inums in the remote replica.  The main difference vs a
:ddsnap volume delta is that not all of the changes are physical blocks,
:there are also changed inode attributes, so the delta stream format
:has to be elaborated accordingly.

    Are you scanning the entire B-Tree to locate the differences?  It
    sounds you would have to as a fall-back, but that you could use the
    forward-log to quickly locate differences if the first version is
    fairly recent.

    HAMMER's mirroring basically works like this:  The master has synchronized
    up to transaction id C, the slave has synchronized up to transaction id A.
    The mirroring code does an optimized scan of the B-Tree to supply all
    B-Tree elements that have been modified between transaction A and C.
    Any number of mirroring targets can be synchronized to varying degrees,
    and can be out of date by varying amounts (even years, frankly).

    I decided to use the B-Tree to optimize the scan.  The B-Tree is
    scanned and any element with either a creation or deletion transaction
    id >= the slave's last synchronization point is then serialized and
    piped to the slave.

    To avoid having to scan the entire B-Tree I perform an optimization
    whereby the highest transaction id laid down at a leaf is propagated
    up the B-Tree all the way to the root.  This also occurs if a B-Tree
    node is physically deleted (due to pruning), even if no elements are
    actually present at the leaf within the transaction range.

    Thus the mirroring scan is able to skip any internal node (and its
    entire sub-tree) which has not been modified after the synchronization
    point, and is able to identify any leaf for which real, physical deletions
    have occured (in addition to logical deletions which simply set the
    delete_tid field in the B-Tree element) and pass along the key range
    and any remaining elements in that leaf for the target to do a
    comparative scan with.

    This allows incremental mirroring, multiple slaves, and also allows
    a mirror to go offline for months and then pop back online again
    and optimally pick up where it left off.  The incremental mirroring
    is important, the last thing I want to do is have to scan 2 billion
    B-Tree elements to do an incremental mirroring batch.

    The advantage is all the cool features I got by doing things that way.
    The disadvantage is that the highest transaction id must be propagated
    up the tree (though it isn't quite that bad because in HAMMER an entire
    flush group uses the same transaction id, so we aren't constantly
    repropagating new transaction id's up the same B-Tree nodes when
    flushing a particular flush group).

    You may want to consider something similar.  I think using the
    forward-log to optimize incremental mirroring operations is also fine
    as long as you are willing to take the 'hit' of having to scan (though
    not have to transfer) the entire B-Tree if the mirror is too far
    out of date.

:I intend to log insertions and deletions logically, which keeps each
:down to a few bytes until a btree rollup episode comes along to perform
:updating of the btree nodes in bulk.  I am pretty sure this will work
:for you as well, and you might want to check out that forward logging
:trick.

    Yes.  The reason I don't is because while it is really easy to lay down
    a forward-log, intergrating it into lookup operations (if you don't
    keep all the references cached in-memory) is really complex code.
    I mean, you can always scan it backwards linearly and that certainly
    is easy to do, but the filesystem's performance would be terrible.
    So you have to index the log somehow to allow lookups on it in
    reverse to occur reasonably optimally.  Have you figured out how you
    are going to do that?

    With HAMMER I have an in-memory cache and the on-disk B-Tree.  Just
    writing the merged lookup code (merging the in-memory cache with the
    on-disk B-Tree for the purposes of doing a lookup) was fairly complex.
    I would hate to have to do a three-way merge.... in-memory cache, on-disk
    log, AND on-disk B-Tree.  Yowzer!

:That reminds me, I was concerned about the idea of UNDO records vs
:REDO.  I hope I have this right: you delay acknowledging any write
:transaction to the submitter until log commit has progressed beyond the
:associated UNDO records.  Otherwise, if you acknowledge, crash, and
:prune all UNDO changes, then you would end up with applications
:believing that they had got things onto stable storage and be wrong
:about that.  I have no doubt you did the right thing there, but it is
:not obvious from your design documentation.

    The way it works is that HAMMER's frontend is almost entirely
    disconnected from the backend.  All meta-data operations are cached
    in-memory.  create, delete, append, truncate, rename, write...
    you name it.  Nothing the frontend does modifies any meta-data
    or mata-data buffers.

    The only sychronization point is fsync(), the filesystem syncer, and
    of course if too much in-memory cache is built up.  To improve
    performance, raw data blocks are not included... space for raw data
    writes is reserved by the frontend (without modifying the storage
    allocation layer) and those data buffers are written to disk by the
    frontend directly, just without any meta-data on-disk to reference
    them so who cares if you crash then!  It would be as if those blocks
    were never allocated in the first place.

    When the backend decides to flush the cached meta-data ops it breaks
    the meta-data ops into flush-groups whos dirty meta-data fits in the 
    system's buffer cache.  The meta-data ops are executed, building the
    UNDO, updating the B-Tree, allocating or finalizing the storage, and
    modifying the meta-data buffers in the buffer cache.  BUT the dirty
    meta-data buffers are locked into memory and NOT yet flushed to
    the media.  The UNDO *is* flushed to the media, so the flush groups
    can build a lot of UNDO up and flush it as they go if necessary.

    When the flush group has completed any remaining UNDO is flushed to the
    media, we wait for I/O to complete, the volume header's UNDO FIFO index
    is updated and written out, we wait for THAT I/O to complete, *then*
    the dirty meta-data buffers are flushed to the media.  The flushes at
    the end are done asynchronously (unless fsync()ing) and can overlap with
    the flushes done at the beginning of the next flush group.  So there
    are exactly two physical synchronization points for each flush.

    If a crash occurs at any point, upon remounting after a reboot
    HAMMER only needs to run the UNDOs to undo any partially committed
    meta-data.

    That's it.  It is fairly straight forward.

    For your forward-log approach you would have to log the operations 
    as they occur, which is fine since that can be cached in-memory. 
    However, you will still need to synchronize the log entries with
    a volume header update to update the log's index, so the recovery
    code knows how far the log extends.  Plus you would also have to log
    UNDO records when making actual changes to the permanent media
    structures (your B-Trees), which is independant of the forward-log
    entries you made representing high level filesystem operations, and
    would also have to lock the related meta-data in memory until the
    related log entries can be synchronized.  Then you would be able
    to flush the meta-data buffers.

    The forward-log approach is definitely more fine-grained, particularly
    for fsync() operations... those would go much faster using a forward
    log then the mechanism I use because only the forward-log would have
    to be synchronized (not the meta-data changes) to 'commit' the work.
    I like that, it would be a definite advantage for database operations.

:>     HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
:>     another limiting factor for the fan-out I can use.  My elements
:>     are 64 bytes each.
:
:Yes, I mostly have 16 byte elements and am working on getting most of
:them down to 12 or 8.

    I don't know how you can make them that small.  I spent months just
    getting my elements down to 64 bytes.  The data reference alone for
    data blocks is 12 bytes (64 bit media storage offset and 32 bit length).

:>     64x64 = 4K per B-Tree node.  I decided to go 
:>     with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
:>     64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
:>     other auxillary info.  That's instead of using a radix-compressed key.
:>     Radix compressed keys would have doubled the complexity of the B-Tree
:>     code, particularly with the middle-of-tree pointer caching that
:>     HAMMER does.
:
:I use two-stage lookup, or four stage if you count searches within
:btree blocks.  This makes the search depth smaller in the case of small
:files, and in the case of really huge files it adds depth exactly as
:appropriate.  The index blocks end up cached in the page cache (the
:buffer cache is just a flavor of page cache in recent Linux) so there
:is little repeated descending through the btree indices.  Instead, most
:of the probing is through the scarily fast radix tree code, which has
:been lovingly optimized over the years to avoid cache line misses and
:SMP lock contention.
:
:I also received a proposal for a "fat" btree index from a collaborator
:(Maciej) that included the file offset but I really did not like the...
:fattening.  A two level btree yields less metadata overall which in my
:estimation is more important than saving some bree probes.  The way
:things work these days, falling down from level 1 cache to level 2 or
:from level 2 to level 3 costs much more than executing some extra CPU
:instructions.  So optimization strategy inexorably shifts away from
:minimizing instructions towards minimizing cache misses.

    The depth and complexity of your master B-Tree will definitely
    be smaller.  You are offloading both the file contents and snapshots.
    HAMMER incorporates both into a single global B-Tree.  This has been
    somewhat of a mixed bag for HAMMER.  On the plus side performance is
    very good for production operations (where the same filename is not
    being created over and over and over and over again and where the same
    file is not truncated over and over and over again).... locality of
    reference is extremely good in a single global B-Tree when accessing
    lots of small files or when accessing fewer larger files.  On the 
    negative side, performance drops noticeably if a lot of historical
    information (aka snapshots) has built up in the data set being
    accessed.  Even though the lookups themselves are extremely efficient,
    the access footprint (the size of the data set the system must cache)
    of the B-Tree becomes larger, sometimes much larger.

    I think the cpu overhead is going to be a bit worse then you are
    contemplating, but your access footprint (with regards to system memory
    use caching the meta-data) for non-historical accesses will be better.

    Are you going to use a B-Tree for the per-file layer or a blockmap?  And
    how are you going to optimize the storage for small files?  Are you
    just going to leave them in the log and not push a B-Tree for them?

:>     The historical access mechanic alone added major complexity to the
:>     B-Tree algorithms.  I will note here that B-Tree searches have a very
:>     high cpu overhead no matter how you twist it, and being able to cache
:>     cursors within the B-Tree is absolutely required if you want the
:>     filesystem to perform well.  If you always start searches at the root
:>     your cpu overhead will be horrendous... so plan on caching cursors
:>     from the get-go.
:
:Fortunately, we get that for free on Linux, courtesy of the page cache
:radix trees :-)
:
:I might eventually add some explicit cursor caching, but various
:artists over the years have noticed that it does not make as much
:difference as you might think.

     For uncacheable data sets the cpu overhead is almost irrelevant.

     But for cached data sets, watch out!  The cpu overhead of your B-Tree
     lookups is going to be 50% of your cpu time, with the other 50% being
     the memory copy or memory mapping operation and buffer cache operations.
     It is really horrendous.  When someone is read()ing or write()ing a
     large file the last thing you want to do is traverse the same 4 nodes
     and do four binary searches in each of those nodes for every read().

     For large fully cached data sets not caching B-Tree cursors will
     strip away 20-30% of your performance once your B-Tree depth exceeds
     4 or 5.  Also, the fan-out does not necessarily help there because
     the search within the B-Tree node costs almost as much as moving
     between B-Tree nodes.

     I found this out when I started comparing HAMMER performance with
     UFS.  For the fully cached case UFS was 30% faster until I started
     caching B-Tree cursors.  It was definitely noticable once my B-Tree
     grew past a million elements or so.  It disappeared completely when
     I started caching cursors into the B-Tree.

:>     If I were to do radix compression I would also want to go with a
:>     fully dynamic element size and fully dynamic fan-out in order to
:>     best-compress each B-Tree node.  Definitely food for thought.
:
:Indeed.  But Linux is braindamaged about large block size, so there is
:very strong motivation to stay within physical page size for the
:immediate future.  Perhaps if I get around to a certain hack that has
:been perenially delayed, that situation will improve:
:
:   "Variable sized page objects"
:   http://lwn.net/Articles/37795/

    I think it will be an issue for people trying to port HAMMER.  I'm trying
    to think of ways to deal with it.  Anyone doing an initial port can 
    just drop all the blocks down to 16K, removing the problem but
    increasing the overhead when working with large files.

:>     I'd love to do something like that.  I think radix compression would
:>     remove much of the topological bloat the B-Tree creates verses using
:>     blockmaps, generally speaking.
:
:Topological bloat?

    Bytes per record.  e.g. the cost of creating a small file in HAMMER
    is 3 B-Tree records (directory entry + inode record + one data record),
    plus the inode data, plus the file data.  For HAMMER that is 64*3 +
    128 + 112 (say the file is 100 bytes long, round up to a 16 byte
    boundary)... so that is 432 bytes.

    The bigger cost is when creating and managing a large file.  A 1 gigabyte
    file in HAMMER requires 1G/65536 = 16384 B-Tree elements, which comes
    to 1 megabyte of meta-data.  If I were to radix-compress those 
    elements the meta-data overhead would probably be cut to 300KB,
    possibly even less.

    Where this matters is that it directly effects the meta-data footprint 
    in the system caches which in turn directly effects the filesystem's
    ability to cache information without having to go to disk.  It can
    be a big deal.

:Ah, that is a very nice benefit of Tux3-style forward logging I forgot
:to mention in the original post: transaction size is limited only by
:available free space on the volume, because log commit blocks can be
:anywhere.  Last night I dreamed up a log commit block variant where the
:associated transaction data blocks can be physically discontiguous, to
:make this assertion come true, shortly after reading Stephen Tweedie's
:horror stories about what you have to do if you must work around not
:being able to put an entire, consistent transaction onto stable media
:in one go:

    Yes, that's an advantage, and one of the reasons why HAMMER is designed
    for large filesystems and not small ones.  Without a forward-log
    HAMMER has to reserve more media space for sequencing its flushes.

    Adding a forward log to HAMMER is possible, I might do it just for
    quick write()/fsync() style operations.  I am still very wary of the
    added code complexity.

:   http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
:
:Most of the nasties he mentions just vanish if:
:
:  a) File data is always committed atomically
:  b) All commit transactions are full transactions

    Yup, which you get for free with your forward-log.

:He did remind me (via time travel from year 2000) of some details I
:should write into the design explicitly, for example, logging orphan
:inodes that are unlinked while open, so they can be deleted on replay
:after a crash.  Another nice application of forward logging, which
:avoids the seek-happy linked list through the inode table that Ext3
:does.

    Orphan inodes in HAMMER will always be committed to disk with a
    0 link count.  The pruner deals with them after a crash.  Orphan
    inodes can also be commited due to directory entry dependancies.

:>     On the flip side, the reblocker doesn't just de-fragment the filesystem,
:>     it is also intended to be used for expanding and contracting partitions,
:>     and adding removing volumes in the next release.  Definitely a
:>     multi-purpose utility.
:
:Good point.  I expect Tux3 will eventually have a reblocker (aka
:defragger).  There are some really nice things you can do, like:
:
:  1) Set a new version so logical changes cease for the parent
:     version.
:
:  2) We want to bud off a given directory tree into its own volume,
:     so start by deleting the subtree in the current version.  If
:     any link counts in the directory tree remain nonzero in the
:     current version then there are hard links into the subtree, so
:     fail now and drop the new version.
:
:  3) Reblock a given region of the inode table tree and all the files
:     in it into one physically contiguous region of blocks
:
:  4) Add a free tree and other once-per volume structures to the new
:     home region.
:
:  5) The new region is now entirely self contained and even keeps its
:     version history.  At the volume manager level, map it to a new,
:     sparse volume that just has a superblock in the zeroth extent and
:     the new, mini filesystem at some higher logical address.  Remap
:     the region in the original volume to empty space and add the
:     empty space to the free tree.
:
:  6) Optionally reblock the newly budded filesystem to the base of the
:     new volume so utilties that do not not play well with sparse
:     volumes do not do silly things.

    Yes, I see.  The budding operations is very interesting to me...
    well, the ability to fork the filesystem and effectively write to
    a snapshot.  I'd love to be able to do that, it is a far superior
    mechanism to taking a snapshot and then performing a rollback later.

    Hmm.  I could definitely manage more then one B-Tree if I wanted to.
    That might be the ticket for HAMMER... use the existing snapshot
    mechanic as backing store and a separate B-Tree to hold all changes
    made to the snapshot, then do a merged lookup between the new B-Tree
    and the old B-Tree.  That would indeed work.

:>     So I'm not actually being cavalier about it, its just that I had to
:>     make some major decisions on the algorithm design and I decided to
:>     weight the design more towards performance and large-storage, and
:>     small-storage suffered a bit.
:
:Cavalier was a poor choice of words, the post was full of typos as well
:so I am not proud of it.  You are solving a somewhat different problem
:and you have code out now, which is a huge achievement.  Still I think
:you can iteratively improve your design using some of the techniques I
:have stumbled upon.  There are probably some nice tricks I can get from
:your code base too once I delve into it.

    Meh, you should see my documents when I post them without taking 10
    editorial passes.  Characters reversed, type-o's, sentences which make
    no sense, etc.

:>     In anycase, it sounds like Tux3 is using many similar ideas.  I think
:>     you are on the right track.
:
:Thankyou very much.  I think you are on the right track too, which you
:have a rather concrete way of proving.
:
:....
:>     rather then have to go through another 4 months of rewriting.  All
:>     major designs are an exercise in making trade-offs in topology, feature
:>     set, algorithmic complexity, debuggability, robustness, etc.  The list
:>     goes on forever.
:> 
:
:The big ahas! that eliminated much of the complexity in the Tux3 design
:were:
:
:  * Forward logging - just say no to incomplete transactions
:  * Version weaving - just say no to recursive copy on write
:
:Essentially I have been designing Tux3 for ten years now and working
:seriously on the simplifying elements for the last three years or so,
:either entirely on paper or in related work like ddsnap and LVM3.
:
:One of the nice things about moving on from design to implementation of
:Tux3 is that I can now background the LVM3 design process and see what
:Tux3 really wants from it.  I am determined to match every checkbox
:volume management feature of ZFS as efficiently or more efficiently,
:without violating the traditional layering between filesystem and
:block device, and without making LVM3 a Tux3-private invention.
:
:>     Laters!

    I can tell you've been thinking about Tux for a long time.  If I
    had one worry about your proposed implementation it would be in the
    area of algorithmic complexity.  You have to deal with the in-memory 
    cache, the log, the B-Tree, plus secondary indexing for snapshotted
    elements and a ton of special cases all over the place.  Your general
    lookup code is going to be very, very complex.

    My original design for HAMMER was a lot more complex (if you can
    believe it!) then the end result.  A good chunk of what I had to do
    going from concept to reality was deflate a lot of that complexity.
    When it got down to brass tacks I couldn't stick with using the 
    delete_tid as a B-Tree search key field, I had to use create_tid.  I
    couldn't use my fine-grained storage model concept because the
    performance was terrible (too many random writes interfering with
    streaming I/O).  I couldn't use a hybrid B-Tree, where a B-Tree element
    could hold the base of an entirely new B-Tree (it complicated the pruning
    and reblocking code so much that I just gave up trying to do it in
    frustration).  I couldn't implement file extents other then 16K and 64K
    blocks (it really complicated historical lookups and the buffer cache
    couldn't handle it) <--- that one really annoyed me.  I had overly
    optimized the storage model to try to get block pointers down to 32 bits
    by localizing B-Tree elements in the same 'super clusters' as the data
    they referenced.  It was a disaster and I ripped it out.  The list goes
    on :-)

    I do wish we had something like LVM on BSD systems.  You guys are very
    lucky in that regard.  LVM is really nice.

    BTW it took all day to write this!

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>

:Hopefully not too much later.  I find this dialog very fruitful.  I just
:wish such dialog would occur more often at the design/development stage
:in Linux and other open source work instead of each group obsessively
:ignoring all "competing" designs and putting huge energy into chatting
:away about the numerous bugs that arise from rushing their design or
:ignoring the teachings of history.
:
:Regards,
:
:Daniel