Description of the Journaling topology
dillon at apollo.backplane.com
Tue Dec 28 00:47:30 PST 2004
I'm making good progress on the journaling layer. It will
still be a week or two before it will be operational, but I've got
the protocol pretty much figured out and will be laying it down in the
tree as I get it working. The work should have virtually no impact on
the system since the codepaths are only exercised when a journal is
attached to a mount point, so I will probably be making smaller commits
to the tree then I did for the VFS work.
When all is said and done the journaling mechanism is going to look
-> [ MEMORY FIFO ] -> [ worker thread ] -> STREAM
/ e.g. 16MB [ secondary spool]
VFSOP -> [journal shim]-- e.g. 16GB
(transaction id) \
-> [filesystem VFS op]
------> target (e.g. an off-site machine).
[transid acks going back]
STREAM = generic file descriptor. e.g. regular file, socket,
fifo, pipe, whatever. Half or full duplex.
The STREAM will optionally be two-way, allowing the journaling target
to tell the journaling layer when a transaction id has been committed to
hard storage. This will also allow the journaling layer to retain
portions of the journaling stream in the MEMORY FIFO and SECONDARY SPOOL
in case the stream connection breaks and needs to be re-created (as would
happen quite often for an off-site journaling stream), without losing
any data, and to handle data backups that occur if the STREAM is a slow
off-site link or if a glitch occurs. The MEMORY FIFO will allow us to
batch operations, reducing context switches and allowing me to implement
the worker thread concept as a very efficient asynchronous design.
Since memory is limited the worker thread will also implement a secondary
spooling store for the case where the journaling stream descriptor is
lost for a long period of time, or if it is simply a slow link (e.g.
real time offsite backup). In these cases we need a secondary spool
to absorb the journaled data and allow the filesystem to continue to
operate instead of just locking it up until the stream is recreated or
catches up. The idea here is to allow potentially huge secondary
spools to be created to literally absorb many hours worth of filesystem
activity, giving a system manager plenty of time to fix things if they
break and so slow off-site links do not slow down normal system activity
to the speed of the off-site link. I consider that extremely important,
it makes the whole concept of a real-time off-site backup feasible.
That's the concept in a nutshell. In addition to all of that the data
being journaled will have a number of options... e.g. the journaling
data stream could be a simple non-reversable stream (a 'replay' stream),
or a fully reversable stream (the ability to 'move' the regenerated
filesystem forwards or backwards in time simply by playing the journaling
stream forwards or backwards), etc etc. It is going to be a *VERY*
powerful mechanism that no other BSD (or even Linux) will have.
Eventually (not in two weeks) the journaling layer will make these acked
transaction ids available to any journal-aware VFS filesystem allowing
the filesystem to leverage the kernel's journaling layer for its own use
and/or to control the underlying filesystem's own management of commits
to physical storage. I also intend to use the journaling layer, with
suitable additional cache coherency protocols, to handle filesystem
synchronization in a clustered environment. In particular, an ability
to do high-level cache-coherent replication that would be immune to
catastrophic corruption rather then block-device-level replication which
tends to propogate corrupting events. As you can see, I have *BIG* plans
for the journaling layer over the next few years.
More information about the Kernel