[Tux3] Comparison to Hammer fs design

Daniel Phillips phillips at phunq.net
Sun Jul 27 18:57:24 PDT 2008


On Sunday 27 July 2008 14:31, Matthew Dillon wrote:
> :When a forward commit block is actually written it contains a sequence
> :number and a hash of its transaction in order to know whether the
> :...
> :Note: I am well aware that a debate will ensue about whether there is
> :any such thing as "acceptable risk" in relying on a hash to know if a
> :commit has completed.  This occurred in the case of Graydon Hoare's
> :Monotone version control system and continues to this day, but the fact
> :is, the cool modern version control systems such as Git and Mercurial
> :now rely very successfully on such hashes.  Nonetheless, the debate
> :will keep going, possibly as FUD from parties who just plain want to
> :use some other filesystem for their own reasons.  To quell that
> :definitively I need a mount option that avoids all such commit risk,
> :perhaps by providing modest sized journal areas salted throughout the
> :volume whose sole purpose is to record log commit blocks, which then
> :are not forward.  Only slightly less efficient than forward logging
> :and better than journalling, which has to seek far away to the journal
> :and has to provide journal space for the biggest possible journal
> :transaction as opposed to the most commit blocks needed for the largest
> :possible VFS transaction (probably one).
> 
>     Well, you've described both sides of the debate quite well.  I for
>     one am in the 0-risk camp.  ReiserFS got into trouble depending on
>     a collision space (albeit a small one).  I did extensive testing of
>     64 bit collision spaces when I wrote Diablo (A USENET news system
>     contemporary with INN, over 10 years ago) and even with a 64 bit space
>     collisions could occur quite often simply due to the volume of article
>     traffic.
>
>     I think the cost can be reduced to the point where there's no need
>     to allow any risk at all.  After all, we're only talking about meta-data
>     updates here for the most part.  Even under heavy loads it can take
>     30 seconds for enough dirty meta-data to build up to require a flush.
>     One can flush the mess, then seek/update the volume header.

So I do not want users to fixate on that detail.  The mount option
allows them to choose between "fast but theoretically riskier" and
"warm n fuzzy zero risk but not quite so fast".  If the example of Ext3
is anything to go by, almost everybody chooses the "ordered data" mode
over the "journal data" mode given the tradeoff that the latter is
about 30% slower but offers better data integrity for random file
rewrites.

The tradeoff for the option in Tux3 will be maybe 1 - 10% slower in
return for a miniscule reduction in the risk of a false positive on
replay.  Along the lines of deciding to live underground to avoid the
risk of being hit by a meteorite.  Anyway, Tux3 will offer the option
and everybody will be happy.

Actually implementing the option is pretty easy because the behavior
variation is well localized.

>     Or doing something like using a fixed LOG area, allowing it to be
>     preformatted... that would remove 100% of the risk and not require
>     a volume header update (drafting note: edited up from further below).
> 
>     I know volume headers seem old-school.  It was a quick and dirty
>     solution in HAMMER that I will be changing.
> 
> :Actually, the btree node images are kept fully up to date in the page
> :cache which is the only way the high level filesystem code accesses
> :them.  They do not reflect exactly what is on the disk, but they do
> :reflect exactly what would be on disk if all the logs were fully
> :rolled up ("replay").
> 
>     Ok, this is simpler.  Hmm.  That brings up the question with regards
>     to data allocations and the log.  Are you going to use a fixed log
>     space and allocate data-store separately or is the log a continuously
>     running pointer across the disk?
> 
>     My understanding is it is a continuously running space.

Yes, a continuous running space, not preallocated.  The forward log
will insert itself into any free blocks that happen to be near the
the transaction goal location.  Such coopted free space will be
implicitly unavailable for block allocation, slightly complicating the
block allocation code which has to take into consideration both the
free space map and all the outstanding log transactions.

> :A forward log that carries the edits to some dirty cache block pins
> :that dirty block in memory and must be rolled up into a physical log
> :before the cache block can be flushed to disk.  Fortunately, such a
> :rollup requires only a predictable amount of memory: space to load
> :enough of the free tree to allocate space for the rollup log, enough
> :...
> 
>     Ok, here I spent about 30 minutes constructing a followup but then
>     you answered some of the points later on, so what I am going to do
>     is roll it up into a followup on one of your later points :-)
> 
> :One traditional nasty case that becomes really nice with logical
> :forward logging is truncate of a gigantic file.  We just need to
> :commit a logical update like ['resize', inum, 0] then the inode data
> :truncate can proceed as convenient.  Another is orphan inode handling
> :where an open file has been completely unlinked, in which case we
> :log the logical change ['free', inum] then proceed with the actual
> :delete when the file is closed or when the log is replayed after a
> :surprise reboot.
> :...
> :Logical log replay is not idempotent, so special care has to be taken
> :on replay to ensure that specified changes have not already been
> 
>     Wait, it isn't?  I thought it was.   I think it has to be because
>     the related physical B-Tree modifications required can be unbounded,

Logical logging is not idempotent by nature because of the uncertainty
of the state of the object edited by a logical operation: has the
operation already been applied or not?  If you know something about the
structure of the target you can usually tell.  Suppose the operation is
a dirent create.  It has been applied to the target if the direct
already exists, otherwise not.  I do not like this style of special
case hacking to force the logical edit to be idempotent.

Instead, I choose to be sure about the state of the target object by
introducing the rule that after a logical log operation has been
generated, nothing is allowed to write to the object being edited.  The
logical operation pins the state of the disk image of target object.
The object stays pinned until the logical log entry has been retired
by a physical commit to the object that updates the object and retires
the logical edit in one atomic log transaction.

I was working on a specific code example of this with pseudocode and
all disk transfers enumerated, but then your email arrived.  Well after
this response the example will probably be better.

>     and because physical B-Tree modifications are occuring in parallel the
>     related physical operations cause the logical operations to become
>     bound together, meaning the logical ops *cannot* be independantly
>     backed out

Tux3 never backs anything out, so there is some skew to clear up.

The logical and physical log operations are indeed bound together at
both the disk image and the processor level, but not in the way that
you might expect.  The primary purpose of rolling up logical log
entries into physical updates is to control resource consumption;
the secondary purpose is to reduce the amount of replay required to
reconstruct "current" memory images of the btree blocks on reboot or
remount.

In the first case, resource exhaustion, a high level vfs transaction
may have to wait for a rollup to finish, cleaning a number of dirty
buffer cache blocks and releasing resources needed for the rollup,
such as bio structs and lists of transaction components.

Such blocking is nominally enforced by the VMM by sending the
requesting process off to scan for and try to free up some dirty
memory (a horribly bad design idea in Linux that causes recursive
calls into the filesystem, but that is the way it works) during which
time the process is effectively blocked.  Tux3 will add a more
responsible level of resource provisioning by limiting the number of
transactions that can be in flight, much as Ext3 does.  This allows
the filesystem to fullfill a promise like "never will use up all of
kernel reserve memory after having been given privileged access to
it in order to clean and recover some dirty cache blocks".

In the second case, keeping the replay path short, high level
operations can proceed in parallel with physical updates, because
a transaction is _logically_ completed as soon as committing the
associated logical edits to disk has completed.

So high level operations block on logical commit completions, not on
physical commit completions except in the case of resource exhaustion.

>     Here's an example:
> 
>     rm a/b/c	<---- 1TB file
>     rmdir a/b
>     rmdir a
> 
>     There are three problems here.
> 
>     First, the logical log must be idempotent, or at least mostly
>     idempotent, because the physical portions of the three separate
>     logical transactions will complete at different times.

If the user writes:

  rm a/b/c&
  rm a/b&
  rm a&

Then they deserve what they get, which is a probable complaint that a
directory does not exist.  If the operations happen to execute in a
favorable order then they will all succeed.  With Tux3 this is possible
because destroying the 1TB file can proceed asynchronously while the VFS
transaction completes and returns as soon as a commit containing

   ['unlink', inum_a/b/c, dnum_a/b]
   ['destroy', inum_a/b/c]

has been logged.  So deleting the 1TB file can be very fast, although
it may be some time before all the occupied disk space becomes
available for reuse.

To make the deletion visible to high level filesystem operations, some
cached disk blocks have to be updated.  For a huge deletion like this
it makes sense to invent a versioned "orphan" attribute, which is added
to the inode table leaf, meaning that any data lookups arriving from
still-open files should ignore the file index tree entirely.

>     Second, the logical log entry for "rm a/b/c" cannot be destroyed
>     (due to log cycling based on available free space) until after the
>     related physical operation has completed, which could occur seconds
>     to minutes later.

True, but that only pins one commit block on disk.

>     The second issue could become a big issue for you because it means
>     the log space required will become unbounded or nearly unbounded.
>     If the remove requires 10's to 100's of megabytes of physical log
>     entries, all appending to the log, and the physical blocks backing
>     the log cannot be recovered due to that 'rm a/b/c' needing to remain
>     in the log until the physical operation is complete, then I think you
>     have a problem.  You would probably need to special-case it.

I do not think I have that problem because recovering the space used
by the big file can proceed incrementally: free a few blocks from the
inode index leaves; commit the revised file index blocks and free tree
blocks; repeat as necessary.  After a crash, find the logical inode
destroy log entry and resume the destroy.

>     A third issue effecting the amount of free space required to complete
>     an operation is going to be the worst case physical log space required
>     for a single low-level B-Tree transaction, which will roughly be 
>     related to the depth of the B-Tree, multiplied by the number of modifying
>     B-Tree operations being done in parallel (e.g. from many individual
>     processes).

Yes indeed.  A matter of totalling all that up, which is fortunately
bounded to a nice low number.  Then in grand Linux tradition, ignore
the totals and just rely on there being "megabytes" of reserve memory
to handle the worst case.  Obviously a bad idea, but that is how we
have done things for many years.

> :...
> :ugly or unreliable when there are lot of stacked changes.  Instead I
> :introduce the rule that a logical change can only be applied to a known
> :good version of the target object, which promise is fullfilled via the
> :physical logging layer.
> :...
> :then apply the logical changes to it there.  Where interdependencies
> :exist between updates, for example the free tree should be updated to
> :reflect a block freed by merging two btree nodes, the entire collection
> :of logical and physical changes has to be replayed in topologically
> :sorted order, the details of which I have not thought much about other
> :than to notice it is always possible.
> 
>     I came up with a really simple solution to the freemap-reuse issue
>     which you could apply to allow such replays to be idempotent.  You
>     simply do not allow a freed block to be reallocated until after all
>     the information related to the logical operation that freed it has
>     been solidly committed.  Thus you can replay the low level frees
>     without accidently re-freeing a block that has been reallocated.
> 
>     In HAMMER I delay reuse of freed blocks because UNDO's are only for
>     meta-data, not file data, and thus freed data blocks cannot be reused
>     until after the related freeing operations has been completely
>     committed so crash recovery can do the rollback.  You could use them
>     to make otherwise non-idempotent operations idempotent.

It sounds like a good idea, I will ponder.

> :When replay is completed, we have a number of dirty cache blocks which
> :are identical to the unflushed cache blocks at the time of a crash,
> :and we have not yet flushed any of those to disk.  (I suppose this gets
> :interesting and probably does need some paranoid flushing logic in
> :replay to handle the bizarre case where a user replays on a smaller
> :memory configuration than they crashed on.)  The thing is, replay
> :returns the filesystem to the logical state it was in when the crash
> :happened.  This is a detail that journalling filesystem authors tend
> :to overlook: actually flushing out the result of the replay is
> :pointless and only obscures the essential logic.  Think about a crash
> :during replay, what good has the flush done?
> 
>     Well, you have to flush at some point anyway because you only have
>     so much kernel memory to play with.... You do not have enough kernel
>     memory to cache the dirty buffers for all the physical activity related
>     to a particular logical operation (such as truncating a 1TB file).

I hope that question is cleared up now.

> :I do not see why this example cannot be logically logged in pieces:
> :
> :   ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"]
> :   ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"]
> :   ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"]
> :   ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"]
> :   ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"]
> :
> :Logical updates on one line are in the same logical commit.  Logical
> :allocations of blocks to record the possibly split btree leaves and new
> :allocations omitted for clarity.  The omitted logical updates are
> :bounded by the depth of the btrees.  To keep things simple, the logical
> :log format should be such that it is impossible to overflow one commit
> :block with the updates required to represent a single vfs level
> :transaction.
> 
>     Hmm.  Maybe.  This helps you identify dependancies that must be
>     replayed, but it might be easier just to make the entire logical
>     log idempotent and just replay everything.  I think the only gotcha
>     there is handling link counts.  That would still be far less complex
>     then the alternative.

The presence of a "link" record in the logical log implies a link count
in addition to whatever is recorded in the on-disk inode table leaf.
On the other hand, the "current" image of the inode table leaf in the
buffer cache has the correct link count when the sys_link returns.

> :transactions are not just throwaway things, they are the actual new
> :data.  Only the commit block is discarded, which I suppose will leave
> :a lot of one block holes around the volume, but then I do not have to
> :require that the commit block be immediately adjacent to the body of
> :the transaction, which will allow me to get good value out of such
> :holes.  On modern rotating media, strictly linear transfers are not
> :that much more efficient than several discontiguous transfers that all
> :land relatively close to each other.
> 
>     True enough.  Those holes will create significant fragmentation
>     once you cycle through available space on the media, though.

It depends on how high the density of such holes is.  I will try to keep
it down to a few per megabyte.  The nice think about having some one
block holes strewn around the disk is, there is always a nearby place
for a commit block to camp out for a while.

> :>     So your crash recovery code will have to handle=20
> :>     both meta-data undo and completed and partially completed transaction=
> :
> :Tux3 does not use undo logging, only redo, so a transaction is complete
> :as soon as there is enough information on durable media to replay the
> :redo records.
> 
>     Ok, that works.  Be careful w/ your physical logging, it can get
>     expensive.  I think it may be more expensive then you are currently
>     estimating.

I am hoping not to mess that up ;-)

The thing about physical logging in Tux3 is, the logged data blocks are
normally immediately linked into the index blocks as actual file data
or btree metadata blocks, so they are only written once, and they are
hopefully written near where the rest of the structure lives.  It is
only in the optional "update in place" style of physical logging that
the commit data blocks are written twice, which will be no worse than a
traditional journal and probably a lot better because of not needing to
seek to the journal region.

>     One of the reasons why HAMMER uses an UNDO log is that it is a lot
>     less expensive then a REDO log.  I'll show you why:
> 
>     Physical operations related to a single flush group in HAMMER:
> 
> 	modify element in B-Tree node A	(create UNDO w/ original contents of
> 					 B-Tree node A)
> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> 	modify element in B-Tree node A	(no new UNDO needs to be created)
> 	modify element in B-Tree node A	(no new UNDO needs to be created)

Nice, now I understand.  But do you not have to hold all filesystem
transactions that depend on the modification until the btree node has
completed writing to disk?  With logical logging you only have to wait
on the logical commit to complete, into which may be packed a number of
other changes for efficiency.

>     (NOTE: I do realize that the REDO log can be compressed just as the
>     UNDO one, by recording actual data shifts from B-Tree insertions
>     and deletions, it gets really complex when you do that, though, and
>     would not be idempotent).

As described above, it is not how I do it.

>     See the advantage?  Many B-Tree operations working on the same node,
>     or revisiting that node later on, within the same flush group, do not
>     have to lay down more then one UNDO record for that B-Tree node.
> 
>     The physical ops are modifying already modified space so I don't
>     have to record each op in the UNDO log.  Upon crash recovery the whole
>     mess is undone by that first UNDO record.
> 
>     I just run the UNDOs backwards, which allows me to cache the in-flush
>     undo areas with a fixed set of small throw-away tracking structures
>     in system memory.  If a structure gets thrown away it just means an
>     extranious UNDO record will be appended to the UNDO FIFO.

I think I see it, but I have my doubts because you have to block
transactions waiting for the up to date copy of the btree data to land
on disk.  Either that, or you may give userspace the impression that
some transaction has gotten onto stable storage when that is not the
case.

If you do in fact block transactions until the current image of the
btree leaf has been written out then the blocking behavior is no
better than REDO style, and REDO can batch together many updates to
different logical blocks into a single commit block, which seems like
fewer transfers overall.

> :Yes.  Each new commit records the sequence number of the oldest commit
> :that should be replayed.  So the train lays down track in front of
> :itself and pulls it up again when the caboose passes.  For now, there
> :is just one linear sequence of commits, though I could see elaborating
> :that to support efficient clustering.
> 
>     Yah.  I see.  Noting that issue I brought up earlier about the
>     "rm a/b/c".  Locking the caboose of your log until all related physical
>     operations have been completed could create a problem.

I hope that is cleared up now.

> :One messy detail: each forward log transaction is written into free
> :space wherever physically convenient, but we need to be sure that that
> :free space is not allocated for data until log rollup has proceeded
> :past that transaction.  One way to do this is to make a special check
> :against the list of log transactions in flight at the point where
> :extent allocation thinks it has discovered a suitable free block, which
> :is the way ddsnap currently implements the idea.  I am not sure whether
> :I am going to stick with that method for Tux3 or just update the disk
> :image of the free tree to include the log transaction blocks and
> :somehow avoid logging those particular free tree changes to disk.  Hmm,
> :a choice of two ugly but workable methods, but thankfully neither
> :affects the disk image.
> 
>     You might be able to adapt the delayed-reuse concept I used in HAMMER
>     to simplify the problem, possibly to the point of making those 
>     operations replayable.
> 
>     Another thing I did in HAMMER was reserve space on the frontend for 
>     data blocks without modifying the freemap at all.  This allows the
>     frontend to write file data blocks to the physical media without
>     having to wait for synchronization with meta-data or the backend or
>     anything.
> 
>     I keep track of the reserved space with small in-memory structures
>     (whos function is to prevent other allocations from using the reserved
>     space until it can be committed).  Freemap updates are then committed
>     in the same flush that commits the related B-Tree insertions and
>     deletions.  Thus the freemap is kept in sync.

I am gravitating towards that style for Tux3's commit-related short
term allocations.

> :Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to
> :beat that compactness.  We have learned through bitter experience that
> :anything other than an Ext2/UFS style physically stable block of
> :dirents makes it difficult to support NFS telldir cookies accurately
> :because NFS vs gives us only a 31 bit cookie to work with, and that is
> :not enough to store a cursor for, say, a hash order directory
> :traversal.  This is the main reason that I have decided to go back to
> :basics for the Tux3 directory format, PHTree, and make it physically
> :stable.
> 
>     The cookies are 64 bits in DragonFly.  I'm not sure why Linux would
>     still be using 32 bit cookies, file offsets are 64 bits so you
>     should be able to use 64 bit cookies.

It is not Linux that perpetrates this outrage, it is NVFS v2.  We can't
just tell everybody that their NFS v2 clients are now broken.

>     For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key
>     and 32 bits is an iterator to deal with hash collisions.  Poof,
>     problem solved.

Which was my original proposal to solve the problem.  Then Ted told me
about NFS v2 :-O

Actually, NFS hands you a 62 bit cookie with the high bits of both s32
parts unused.  NFS v2 gives you a 31 bit cookie.  Bleah.

> :In the PHTree directory format lookups are also trivial: the directory
> :btree is keyed by a hash of the name, then each dirent block (typically
> :one) that has a name with that hash is searched linearly.  Dirent block
> :pointer/hash pairs are at the btree leaves.  A one million entry
> :directory has about 5,000 dirent blocks referenced by about 1000 btree
> :leaf blocks, in turn referenced by three btree index blocks (branching
> :factor of 511 and 75% fullness).  These blocks all tend to end up in
> :the page cache for the directory file, so searching seldom references
> :the file index btree.
> 
>     That seems reasonable.  I'll look up the PHTree stuff, it sounds
>     interesting.  Personally speaking I don't mind eating 64 bytes per
>     directory entry.  I can accomodate file names up to 16 bytes in
>     the B-Tree element itself so 95% of the namespace doesn't even
>     require a look-aside data reference.
> 
>     The one mistake I made, which I didn't have time to fix in the
>     first HAMMER release, is that my directory hash is a standard crc
>     which means that records in the B-Tree aren't even close to being
>     in any sort of lexical order.  I will probably wind up changing
>     the hash calculation a bit to get some lexical sorting information
>     stuffed into it... should be doable since the hash keys are 64 bits.

Yes, I noticed that.  Check out dx_hack_hash:

   http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38

It distributed hashes of ascii strings very well for me, with few
funnels.  It way outperformed some popular hashes like TEA.  Ted's cut
down crytographic hash is yet more uniform but costs much more CPU.

> :will pass Hammer in lookup speed for some size of directory because of
> :the higher btree fanout.  PHTree starts from way behind courtesy of the
> :32 cache lines that have to be hit on average for the linear search,
> :amounting to more than half the CPU cost of performing a lookup in a
> :million element directory, so the crossover point is somewhere up in
> :the millions of entries.
> 
>     Well, the B-Tree fanout isn't actually that big a deal.  Remember
>     HAMMER reblocks B-Tree nodes along with everything else.  B-Tree
>     nodes in a traversal are going to wind up in the same 16K filesystem
>     block.

It affects the number of probes you have to do for the lookup.
Binsearching hits one cache line per test while each node lookup hits a
lot more.

>     That leaves just element space overhead, which frankly is not much
>     of an issue on a production system because kernels have to cache a lot
>     more then just the directory entry.  The in-memory vnode and inode
>     structures take up a lot more space then the cached directory entry.
>     So even though HAMMER's directory entry is a 64-byte B-Tree element
>     and roughly double the overhead of a UFS-style directory entry, the
>     overall cache footprint in the system is only ~20% worse.

It is not just the cache footprint but the time it takes to get your
key tables into L1 cache.  To be sure, we are talking about small
details here, but these small details can add up to a difference of a
factor of two in execution speed.  Which maybe you don't care much about
now that you are orders of magnitude faster than what came before ;-)

> :...
> :>     that the B-Tree can seek to directly.  In fact, at one point I tried
> :>     indexing on delete_tid instead of create_tid, but created one hell of=
> : a
> :>     mess in that I had to physically *move* B-Tree elements being deleted
> :>     instead of simply marking them as deleted.
> :
> :This is where I do not quite follow you.  File contents are never
> :really deleted in subsequent versions, they are just replaced.  To
> :truncate a file, just add or update size attribute for the target
> :version and recover any data blocks past the truncate point that
> :belongs only to the target version.
> 
>     Truncation is a deletion.  That is, the B-Tree elements have to
>     be marked as having been deleted so if you truncate-extend a file
>     you see the new space as being full of zeros.
> 
>     Take a 1MB file.  Truncate it to 512K, then truncate-extend the
>     file back to 1MB.  Or seek-write to extend the file...
>     same difference, really.

Following my prescription above, the file will be full of zeros when
re-extended because the data pointers were removed.  But I think you
are right, truncate really is a delete and I actually argued myself
into understanding that.

> :I plan to scan the whole btree initially, which is what we do in ddsnap
> :and it works fine.  But by looking at the mtime attributes in the inode
> :I can see in which versions the file data was changed and thus not have
> :to scan most files.  Eventually, building in some accelerator to be able
> :to skip most inode leaves as well would be a nice thing to do.  I will
> :think about that in the background.
> 
>     Yah, that makes sense.  You won't have to scan every last tidbit of
>     information.  HAMMER would have had to if I had not added the
>     propagating transaction id feature to it.
> 
>     You are still going to have a scaleability issue but it sounds like
>     you could implement the exact same optimization and completely solve
>     the problem.

Yes, I have thought about it a little more, and I imagine something
like a small array of dirty bits that climb up the btree with the help
of logical logging, where each bit means that there is something
interesting for some corresponding replication target to look at,
somewhere down in the subtree.

> :>     The way it works is that HAMMER's frontend is almost entirely
> :>     disconnected from the backend.  All meta-data operations are cached
> :>     in-memory.  create, delete, append, truncate, rename, write...
> :>     you name it.  Nothing the frontend does modifies any meta-data
> :>     or mata-data buffers.
> :
> :Things are a little different in Linux.  The VFS takes care of most of
> :what you describe as your filesystem front end, and in fact, the VFS is
> :capable of running as a filesystem entirely on its own, just by
> :supplying a few stub methods to bypass the backing store (see ramfs).
> :I think this is pretty much what you call your front end, though I
> :probably missed some major functionality you provide there that the
> :VFS does not directly provide.
> 
>     Well, for file data, up to a point, and HAMMER does just use the 
>     kernel's buffer cache for file data.
> 
>     Not for directory entries.  I'm pretty sure Linux uses a throw-away
>     namecache just like we do, so unless it has a mechanic for flush
>     sequencing of dirty namecache entries you need in-VFS structural
>     support for caching the operations.

The Linux dentry cache actually implements proper namespace semantics
all by itself without needing to flush anything, which is what Ramfs
is.  Ramfs just lives entirely in cache without even being backed by
a ramdisk, until the power goes off.

>     In particular, caching a namespace deletion (rm, rename, etc)
>     without touching the meta-data requires implementing a merged lookup
>     so you can cancel-out the directory entry that still exists on-media.
>     So it isn't entirely trivial.

Linux implements "negative dentries" to say "this entry is not here".

>     Things related to file data are handled by the kernel for the most
>     part, but truncations still need considerable support within the
>     VFS to properly handle seek-write or truncate file extensions.
>     In order to avoid touching any meta-data when handling a truncation,
>     a similar canceling field in the in-memory inode needs to be
>     maintained until the backend is able to delete the related data
>     blocks.  Lookups have to take into account the cached truncation offset
>     so as not to read data from the media past the truncation point (e.g.
>     if you seek-write-extend or truncate-extend the file immediately
>     after truncating the file).

I have not looked at that part for a while now, and I did not look at
it just now, but Al Viro has been fiddling with it for years getting the
bugs out one at a time.  The last major one I heard of was some time
ago.  It works somehow, I should look more closely at it.

>     Most filesystems will dirty meta-data buffers related to the media
>     storage as part of executing an operation on the frontend.  I don't
>     know of any which have the level of separation that HAMMER has.

Modern Linux filesystems get close I think.  Particularly in journalled
data mode, Ext3 marks all the buffers it deals with as "don't touch" to
the VFS and VMM, which have no idea how to obey the necessary ordering
constraints.  There is also this thing called the "journalling block
device" that provides an abstract implementation of a physical journal,
which is actually used by more than one filesystem.  (Three I think,
including Ext4, however I now hear noise about rewriting it.)

> :Note: no other filesystem on Linux current works this way.  They all
> :pretty much rely on the block IO library which implements the fairly
> :goofy one-block-at-a-time transfer regimen.  The idea of setting up bio
> :transfers directly from the filesystem is unfortunately something new.
> :We should have been doing this for a long time, because the interface
> :is quite elegant, and actually, that is just what the block IO library
> :is doing many levels down below.  It is time to strip away some levels
> :and just do what needs to be done in the way it ought to be done.
> 
>     Jeeze, BSD's have been doing that forever.  That's what VOP_BMAP is
>     used for.  I'm a little surprised that Linux doesn't do that yet.
>     I'll expand on that down further.

Either you forgot to expand or I missed it.  I am interested.

> :Directory format to index the extent within a leaf:
> :
> :   struct entry { unsigned loglo:24, offset:8 };
> :   struct group { unsigned loghi:24, count:8 };
>     ...
>     Man, those are insanely small structures.  Watch out for the cpu
>     and in-kernel-memory tracking overhead.

I have written the code, mostly, and it is tight.  A similar idea has
been part of ddsnap for 5 years now.

> :  * The linux page cache not only caches data but provides an efficient
> :    access path to logical blocks that avoids having to consult the
> :    filesystem metadata repeatedly under heavy traffic.
> 
>     Actually, BSD buffer caches are logical caches, and always have been.
>     Linux actually took a page from our book on that one.  In fact, if
>     you do a google search I'll bet you will see my name in some of those
>     Linux VM system debates 5-10 years ago :-)

I remember it well.  You were the one who put Rik up the reverse map
design that caused the main fireworks between 2.3 and 2.4.9.  (I was
the one who finally got it work by social-engineering the merging of
the reverse map part with Andrea's more robust LRU design.)

I also remember the time when BSD buffers were far superior to Linux
ones, yes.  In 2.0 the arrangement sucked enormously: sort-of coherency
between the buffer cache and the page cache was achieved by physically
copying changes from one to the other.

Today, the buffer cache and page cache are nearly fully unified.  But
the unification will never be complete until the page size is variable,
so the remaining tasks done by buffers can be done by pages instead.

Anyway, this part of your OS is more similar than different, which will
help a lot with a port.

>     The typical BSD (Open, Net, Free, DragonFly, etc) buffer cache structure
>     is a logically indexed entity which can also contain a cached physical
>     translation (which is how the direct data bypass works).

Linux caches physical translations by gluing one or more buffer heads
onto each page, which is one of the few remaining uses of buffer heads.
To finally get rid of buffer heads the physical cache needs to be done
some other way.  I am sure there are lots of better ways.

>     I'm just going to throw this out.  In an earlier incarnation of the
>     HAMMER design I had a forward log which embedded data, and the idea
>     was that I would leave the data in-place in the log.  I was going to
>     use a blockmap to turn the log into a sparse-map as a means of
>     recovering space from it.
> 
>     I eventually threw away that idea but it occurs to me that you might
>     be able to use a similar idea to compress small bits of meta-data.
 
This is similar to what I do when I write out a physical block and use
a logical log entry to make the on-disk parent point at it (actually,
it is a promise to make the parent point at it some time in the future.
But then I do not let those things live very long.  Eventually it might
make sense to let them live a little longer, perhaps by generalizing
the idea of the single linear log to a forest of little logs, some of
which could be left around for a long time instead of being quickly
rolled up into the canonical structures.

> :>     Orphan inodes in HAMMER will always be committed to disk with a
> :>     0 link count.  The pruner deals with them after a crash.  Orphan
> :>     inodes can also be commited due to directory entry dependancies.
> :
> :That is nice, but since I want to run without a pruner I had better
> :stick with the logical logging plan.  The link count of the inode will
> :also be committed as zero, or rather, the link attribute will be
> :entirely missing at that point, which allows for a cross check.
> 
>     It won't work for you if your logical log is meant to be short-lived.
>     Programs can hold unlinked inodes open for hours, even days.  Even
>     longer.  You would have to rewrite the entry in order to be able to
>     move the caboose.  Well, that isn't a bad idea, just more complication.

Indeed, I need to handle that.  Since I do want to stick with a simple
linear sequence of logical log commits for now, I do not want to leave
any of them sitting around for a log time.  One easy thing to do is to
put any orphans aside at rollup time, in an unversioned orphan file, say.

>     Maybe the ticket for DragonFly is to simply break storage down into
>     a reasonable number of pieces, like cutting up a drive into 256 pieces,
>     and create a layer to move and glue those pieces together into larger
>     logical entities.

Linux LVM is all about device mapper, and actually, device mapper does
not really have a reason to be a separate subsystem, it can just be the
way the block layer works.  If interested, check out my posts on bio
stacking and rolling device mapper's "clone_and_map" idea up into the
generic block layer.  I think all unixen should do this.

Other than that, device mapper is just about making device numbers
point at different virtual devices, transparently to userspace.  Even
device stacking works like that: extract the "virtual" device (which
may or may not be a real device) from the device number you are going
to remap, store it in some other device number, create a new virtual
device on top of the other device number, finally store the new virtual
device in the old device number.  The old device number has now been
transparently stacked.  Nothing to it.

>     I'm editing down as much as I can :-)   This one took 6 hours, time to
>     get lunch!

Only 3 hours so far here...  hmm, make that four.

Daniel





More information about the Kernel mailing list