Description of the Journaling topology

Tue Dec 28 00:47:30 PST 2004

    I'm making good progress on the journaling layer.  It will
    still be a week or two before it will be operational, but I've got
    the protocol pretty much figured out and will be laying it down in the
    tree as I get it working.  The work should have virtually no impact on
    the system since the codepaths are only exercised when a journal is
    attached to a mount point, so I will probably be making smaller commits
    to the tree then I did for the VFS work.

    When all is said and done the journaling mechanism is going to look
    like this:

			      -> [ MEMORY FIFO ] -> [ worker thread ] -> STREAM
			     /	   e.g. 16MB	    [ secondary spool]
    VFSOP -> [journal shim]--				e.g. 16GB
            (transaction id) \
			      -> [filesystem VFS op]


		          ------> target (e.g. an off-site machine).
		    STREAM	     |
			  <----------+
	     [transid acks going back]

	     STREAM = generic file descriptor.  e.g. regular file, socket,
	     fifo, pipe, whatever.  Half or full duplex.


    The STREAM will optionally be two-way, allowing the journaling target
    to tell the journaling layer when a transaction id has been committed to
    hard storage.  This will also allow the journaling layer to retain
    portions of the journaling stream in the MEMORY FIFO and SECONDARY SPOOL
    in case the stream connection breaks and needs to be re-created (as would
    happen quite often for an off-site journaling stream), without losing
    any data, and to handle data backups that occur if the STREAM is a slow 
    off-site link or if a glitch occurs.  The MEMORY FIFO will allow us to
    batch operations, reducing context switches and allowing me to implement
    the worker thread concept as a very efficient asynchronous design.

    Since memory is limited the worker thread will also implement a secondary
    spooling store for the case where the journaling stream descriptor is
    lost for a long period of time, or if it is simply a slow link (e.g.
    real time offsite backup).  In these cases we need a secondary spool
    to absorb the journaled data and allow the filesystem to continue to
    operate instead of just locking it up until the stream is recreated or
    catches up.  The idea here is to allow potentially huge secondary
    spools to be created to literally absorb many hours worth of filesystem
    activity, giving a system manager plenty of time to fix things if they
    break and so slow off-site links do not slow down normal system activity
    to the speed of the off-site link.  I consider that extremely important,
    it makes the whole concept of a real-time off-site backup feasible.

    That's the concept in a nutshell.  In addition to all of that the data
    being journaled will have a number of options... e.g. the journaling
    data stream could be a simple non-reversable stream (a 'replay' stream),
    or a fully reversable stream (the ability to 'move' the regenerated 
    filesystem forwards or backwards in time simply by playing the journaling
    stream forwards or backwards), etc etc.  It is going to be a *VERY*
    powerful mechanism that no other BSD (or even Linux) will have.

    Eventually (not in two weeks) the journaling layer will make these acked
    transaction ids available to any journal-aware VFS filesystem allowing
    the filesystem to leverage the kernel's journaling layer for its own use
    and/or to control the underlying filesystem's own management of commits
    to physical storage.  I also intend to use the journaling layer, with
    suitable additional cache coherency protocols, to handle filesystem
    synchronization in a clustered environment.  In particular, an ability
    to do high-level cache-coherent replication that would be immune to 
    catastrophic corruption rather then block-device-level replication which
    tends to propogate corrupting events.  As you can see, I have *BIG* plans
    for the journaling layer over the next few years.

						-Matt