The Clustering and Userland VFS transport protocol - summary
Matthew Dillon
dillon at apollo.backplane.com
Wed May 10 21:58:27 PDT 2006
I have decided to use the journaling protocol's stream-based data
structures, with some modification, as the basis for communication
between machines in a cluster as well as the basis for communications
between a userland VFS implementation and the kernel.
I've written up a fairly long summary of my intent, so people have
an idea where I'm going with all of this :-).
For anyone who hasn't read up on our journaling protocol, it is
a highly flexible and dynamic recursive transaction-oriented data
structure that can be represented over a data stream using a fixed-sized
memory FIFO to buffer the stream. Streams are always two-way. The
basic journaling protocol has a primary data stream in one direction
and an 'ack' stream in the other. What I intend to do is expand this
to allow transactions to occur in both directions independantly,
creating a full duplex transport link between two entities.
In addition to making the link fully symmetric, some additional
modifications will have to be made to the journaling data structures
to make it work well for userland VFS or clustering links.
The basic problem is that the current data structure is just a bit too
flexible (see sys/journal.h). The size of the transaction is not
really known until the entire transaction has been parsed, and that
creates a serious problem for the receiver. The core of the problem
is that data embedded within the transaction can be huge... a single
transaction involving a 1GB write() would have to embed 1GB of data,
for example. While the data is broken up into bite-sized pieces by
the protocol, the overall handling of the transaction by the
receiver still requires the target to parse an unbounded amount of
data in order to 'parse' the transaction. The current 'jscan'
user utility malloc()'s memory to hold the entire transaction
and that clearly does not scale when a transaction can literally be
gigabytes long.
So I am going to make a few changes to the journaling stream
structures before I really get into writing the userland VFS. These
changes will allow the protocol to be used not only for the userland
VFS interface, but also for the data streams that will eventually
link machines within a cluster, and of course for the existing
journaling feature.
* Transaction records will be limited in size, allowing the entire
transaction sans data to be efficiently parsed directly out of a
memory-FIFO whos size is pre-negotiated.
* Data elements will be passed by reference and the data itself, if
too large, will then be passed as bite-sized cache elements in
separate transactions or passed out of band via some other negotiated
mechanism.
* The addition of a data element caching sub-protocol between the
originator and target.
These three simple changes will yield some amazing capabilities. One
is that out of band access methods can be negotiated to take advantage
of available topology. For example, if the transmitter and
receiver happen to be on the same machine, a memory-mapped FIFO could
be negotiated to access the transaction stream instead of a pipe or
TCP socket.
Another huge benefit is that by passing bulk data by reference, the
actual transport of the bulk data itself can be handled out-of-band.
For example, if two machines both have access to common SAN storage,
the bulk data could simply be written to the common storage and
referenced via its common storage location instead of passed within
the transaction stream itself. Alternatively, small bits of data
could simply be embedded within the transaction and medium bits of
data (or when no other communications mechanism is available) could
be transported over the same stream via separate transactions.
A third big benefit is that data need not be transported at all in
many cases, or the transport can be delayed to improve performance.
Take a common operation such as this for example:
machine_C% cp file_on_machine_A file_on_machine_B
Or even more common:
machine_C% cp file_on_machine_A file_on_machine_A
In a clustered environment the execution context (what 'cp' is actually
running on) can be anywhere. But there is absolutely no reason for the
file data to physically pass through that machine if 'cp' itself does
not need to know what the file contains. If done properly, the actual
file data would be transported directly from machine A to machine B,
or stay strictly within machine A in the second example.
For a userland VFS, the kernel would be able to notify the VFS of
the data associated with a write but could otherwise just store the
data in the buffer cache until the VFS decides it actually needs to
retrieve it (to flush it to backing store). Conversely, if the
kernel needs data from the VFS, the VFS can simply pass a cache ID
back to the kernel and the kernel can then access the data via the
cache ID at its leisure... as part of the uiomove() it does for example
rather then as part of the VOP_READ().
This is what the third item, the caching sub-protocol, would be used
for. Machine A would transmit cache elements by reference to machine C
(NOT the actual data). Machine C would forward those elements BY
REFERENCE to machine B. Machine B would then be able to access the
elements directly from machine A if a more direct path is available.
Caching sub-protocol
The cache element management protocol is really simple in concept.
Simple implementations are possible, though not necessarily advisable.
Still, the fact that the protocol can be implemented as a degenerate
case means that it is extremely flexible.
Whenever the originator wants to send a bit of bulk data by reference,
it needs to keep track of the bulk data in question and assign a
unique cache ID to it. The cache ID is then sent over the data
stream (NOT the data). The target can use this globally unique
reference ID to obtain the cached data at any point in the future after
it receives the ID. When the target is 'finished' with the
ID it sends a de-ref transaction back to the originator, allowing
the original to throw away the cached data.
Cache data elements are ref-counted. Every time the originator
generates a reference to a cached data element in the transaction stream
it bumps the ref count for that element. Every time the target
disposes of a cached data element it accumulates and sends back a
de-ref count for that element. The originator can throw away the
cached data element only when the ref count returns to 0.
Since both the originating and target machines implement independant
caches, and since these caches are not infinite in size, the originating
machine needs to be able to force a deref... to 'flush' its cache to make
room for more data. It would do this by sending a ref-count flush
for the cache ID(s) in question to the target machine. The target
would be required to flush that number of refs from the cached data
elements specified, allowing the originating machine to deallocate the
cache element. If the target is not finished with the data in question
it must make a local copy or otherwise ensure that it has a local copy
of the cached data before sending the de-ref back to the originator,
and if the target doesn't have enough cache space itself then it must
flush the related transactional commands to get rid of its dependancies
on the data in question.
If you think about it, this is an extremely simple and straightforward
protocol that nevertheless is extremely powerful in its ability to
optimize the volume of data going over a data stream. Cache management
winds up being a function of both the originator and target, allowing
both to 'size' their caches according to their own limitations.
* The originator can choose to pass by reference, to pass by reference
and also pass the data associated with the data cache ID, or to flush
some or all of its cached data elements in order to make room for
more, albeit by sending a forced 'flush' command down the pipe to
the target.
* The target can choose to throw away anonymously received cache
data elements at any time if its own cache is not large enough to
hold them, since it can always re-request the elements at least until
it sends the de-ref back to the originator.
* The target may be forced to hold a cached data element that the
originator can no longer afford to keep in its own cache, but all
this means is that the target then is forced simply to synchronize
the transactional commands that it had received that reference the
cache. This works no differently then how dirty buffers in our buffer
cache work now.
* The pre-existing journaling protocol had an 'ACK' going along the
return path. This becomes more sophisticated in the new scheme.
In the new scheme, the return stream can contain cache management
commands, de-ref's, and other things.
The new scheme sans caching is actually going to be simpler. Because
transactions will 'fit' into the memory FIFO completely with the new
scheme, the transaction id and the sequence id will become one...
there would ONLY be a transaction id.
* The data stream can be broken and restarted without loss of
information. Both sides generate a sequential transaction id on
transmit and receive acks, allowing both sides to determine when
old data can be thrown away and allowing a connection to be
broken, restarted, and resynchronized.
How big should a cache ID be?
So how big should these cache IDs be? Well, there are numerous
ways to represent cache elements in a global system. One could
attempt to represent the actual disk volume and block offset, for
example, along with the machine ID and numerous other pieces of
information. That would need at least 128 bits and probably more.
But since I have already settled on a reference counted scheme for
cached data elements, our cache is going to act more like a computer's
L1/L2 cache. That is, it will have a tag structure (containing at
least the reference count) and be actively managed rather then
passively referenced. The actual data could be stored on-disk or
in-memory or whatever on the originating machine, but the cache ID
itself would represent the cache table entry, not the location of
the actual data.
For us, this means that our 'global' cache identifiers can be 32 or
64 bits. I'll probably use 64 bits. This would give us room for,
say, a 16 bit mount ID, a 24 bit machine ID and a 24 bit 'cache line' ID.
Such IDs would not have to address the data's ultimate storage location,
they would simply have to identify the machine and 'cache line'.
In particular if we consider a userland VFS (one of the mounts within a
machine), we want the userland VFS to be able to cache a piece of data
and send the related cache ID out into the ether. That cache ID could
end up ANYWHERE within the cluster without an actual data transfer
occuring until the eventual target decides it needs to see the actual
data. If you think of the data is a buffer cache buffer, the actual
target might not need to see it until it actually decides to write
it out to its own backing store. This gives us an unbelievable amount
of flexibility in the transport of data within a cluster or even
just within a single machine. The feature will be especially useful
when we wrap demand-paged virtual memory around it.
Considerations in a clustered environment
Can such a caching scheme be used in a clustered environment? Well, that
is my goal so the answer is yes! If the cache ID embeds the machine ID
then each individual machine in the cluster can generate and manage
unique cache IDs without having to talk to other machines in the
cluster, and the data related to any given cache ID can always be fetched
using just the cache ID since the machine that owns that ID would be
parsed from the ID itself, thus allowing a cache ID to be forwarded to
any machine in the cluster and yet still represent the same data.
The problem we face in allowing this sort of global access is how to
track active cache elements so an originating machine can 'flush' its
cache efficiently or, even more to the point, to allow all the copies of
the cache to operate within a cache coherency management scheme.
If a bit of cached data is passed by reference from A to B, and B then
forwards that bit of data to C, A has a problem in that it knows how
many references there are to the cache item, but it does not know
which machines are holding those references. How then does A 'flush'
that cache element in a 1000-machine cluster? The clustering topology
is dynamic, A cannot simply send a flush command to B and have B
forward it to C. It would be possible for C to inform A that it now
has a reference, and for A to cache this information, but this doesn't
really scale to situations where, say, 100 machines might be holding a
copy of a single cache ID. At some point the cluster could resort to
a broadcast.
In anycase, it is a hard problem to solve efficiently but at least there
are less efficient means (i.e. broadcast to all machines in the
cluster via spanning tree) to do the required operation so finding a
solution is a matter of efficiency and not a show stopper.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list