The Clustering and Userland VFS transport protocol - summary

Matthew Dillon dillon at apollo.backplane.com
Wed May 10 21:58:27 PDT 2006


    I have decided to use the journaling protocol's stream-based data
    structures, with some modification, as the basis for communication
    between machines in a cluster as well as the basis for communications
    between a userland VFS implementation and the kernel.

    I've written up a fairly long summary of my intent, so people have
    an idea where I'm going with all of this :-).

    For anyone who hasn't read up on our journaling protocol, it is
    a highly flexible and dynamic recursive transaction-oriented data
    structure that can be represented over a data stream using a fixed-sized
    memory FIFO to buffer the stream.  Streams are always two-way.  The
    basic journaling protocol has a primary data stream in one direction
    and an 'ack' stream in the other.  What I intend to do is expand this
    to allow transactions to occur in both directions independantly,
    creating a full duplex transport link between two entities.

    In addition to making the link fully symmetric, some additional
    modifications will have to be made to the journaling data structures
    to make it work well for userland VFS or clustering links.

    The basic problem is that the current data structure is just a bit too
    flexible (see sys/journal.h).  The size of the transaction is not
    really known until the entire transaction has been parsed, and that
    creates a serious problem for the receiver.  The core of the problem
    is that data embedded within the transaction can be huge... a single 
    transaction involving a 1GB write() would have to embed 1GB of data, 
    for example.  While the data is broken up into bite-sized pieces by
    the protocol, the overall handling of the transaction by the 
    receiver still requires the target to parse an unbounded amount of 
    data in order to 'parse' the transaction.  The current 'jscan'
    user utility malloc()'s memory to hold the entire transaction
    and that clearly does not scale when a transaction can literally be
    gigabytes long.

    So I am going to make a few changes to the journaling stream
    structures before I really get into writing the userland VFS.  These
    changes will allow the protocol to be used not only for the userland
    VFS interface, but also for the data streams that will eventually
    link machines within a cluster, and of course for the existing
    journaling feature.

    * Transaction records will be limited in size, allowing the entire
      transaction sans data to be efficiently parsed directly out of a
      memory-FIFO whos size is pre-negotiated.

    * Data elements will be passed by reference and the data itself, if
      too large, will then be passed as bite-sized cache elements in
      separate transactions or passed out of band via some other negotiated 
      mechanism.

    * The addition of a data element caching sub-protocol between the
      originator and target.

    These three simple changes will yield some amazing capabilities.  One
    is that out of band access methods can be negotiated to take advantage
    of available topology.  For example, if the transmitter and
    receiver happen to be on the same machine, a memory-mapped FIFO could
    be negotiated to access the transaction stream instead of a pipe or
    TCP socket.

    Another huge benefit is that by passing bulk data by reference, the
    actual transport of the bulk data itself can be handled out-of-band.
    For example, if two machines both have access to common SAN storage,
    the bulk data could simply be written to the common storage and
    referenced via its common storage location instead of passed within
    the transaction stream itself.  Alternatively, small bits of data
    could simply be embedded within the transaction and medium bits of
    data (or when no other communications mechanism is available) could
    be transported over the same stream via separate transactions.

    A third big benefit is that data need not be transported at all in
    many cases, or the transport can be delayed to improve performance.
    Take a common operation such as this for example:

	machine_C% cp file_on_machine_A file_on_machine_B

    Or even more common:

	machine_C% cp file_on_machine_A file_on_machine_A

    In a clustered environment the execution context (what 'cp' is actually
    running on) can be anywhere. But there is absolutely no reason for the
    file data to physically pass through that machine if 'cp' itself does
    not need to know what the file contains.  If done properly, the actual
    file data would be transported directly from machine A to machine B,
    or stay strictly within machine A in the second example.

    For a userland VFS, the kernel would be able to notify the VFS of
    the data associated with a write but could otherwise just store the
    data in the buffer cache until the VFS decides it actually needs to
    retrieve it (to flush it to backing store).  Conversely, if the
    kernel needs data from the VFS, the VFS can simply pass a cache ID
    back to the kernel and the kernel can then access the data via the
    cache ID at its leisure... as part of the uiomove() it does for example 
    rather then as part of the VOP_READ().

    This is what the third item, the caching sub-protocol, would be used
    for.  Machine A would transmit cache elements by reference to machine C
    (NOT the actual data).  Machine C would forward those elements BY 
    REFERENCE to machine B.  Machine B would then be able to access the
    elements directly from machine A if a more direct path is available.

				Caching sub-protocol

    The cache element management protocol is really simple in concept.
    Simple implementations are possible, though not necessarily advisable.
    Still, the fact that the protocol can be implemented as a degenerate
    case means that it is extremely flexible.

    Whenever the originator wants to send a bit of bulk data by reference,
    it needs to keep track of the bulk data in question and assign a
    unique cache ID to it.  The cache ID is then sent over the data
    stream (NOT the data).  The target can use this globally unique
    reference ID to obtain the cached data at any point in the future after
    it receives the ID.   When the target is 'finished' with the
    ID it sends a de-ref transaction back to the originator, allowing
    the original to throw away the cached data.

    Cache data elements are ref-counted.  Every time the originator 
    generates a reference to a cached data element in the transaction stream
    it bumps the ref count for that element.  Every time the target 
    disposes of a cached data element it accumulates and sends back a
    de-ref count for that element.  The originator can throw away the
    cached data element only when the ref count returns to 0.

    Since both the originating and target machines implement independant
    caches, and since these caches are not infinite in size, the originating
    machine needs to be able to force a deref... to 'flush' its cache to make
    room for more data.  It would do this by sending a ref-count flush
    for the cache ID(s) in question to the target machine.  The target 
    would be required to flush that number of refs from the cached data
    elements specified, allowing the originating machine to deallocate the
    cache element.  If the target is not finished with the data in question
    it must make a local copy or otherwise ensure that it has a local copy
    of the cached data before sending the de-ref back to the originator, 
    and if the target doesn't have enough cache space itself then it must
    flush the related transactional commands to get rid of its dependancies
    on the data in question.

    If you think about it, this is an extremely simple and straightforward
    protocol that nevertheless is extremely powerful in its ability to 
    optimize the volume of data going over a data stream.  Cache management
    winds up being a function of both the originator and target, allowing
    both to 'size' their caches according to their own limitations.

    * The originator can choose to pass by reference, to pass by reference
      and also pass the data associated with the data cache ID, or to flush
      some or all of its cached data elements in order to make room for
      more, albeit by sending a forced 'flush' command down the pipe to 
      the target.

    * The target can choose to throw away anonymously received cache
      data elements at any time if its own cache is not large enough to 
      hold them, since it can always re-request the elements at least until
      it sends the de-ref back to the originator.

    * The target may be forced to hold a cached data element that the
      originator can no longer afford to keep in its own cache, but all
      this means is that the target then is forced simply to synchronize
      the transactional commands that it had received that reference the
      cache.  This works no differently then how dirty buffers in our buffer
      cache work now.

    * The pre-existing journaling protocol had an 'ACK' going along the
      return path.  This becomes more sophisticated in the new scheme.
      In the new scheme, the return stream can contain cache management
      commands, de-ref's, and other things.

      The new scheme sans caching is actually going to be simpler.  Because
      transactions will 'fit' into the memory FIFO completely with the new
      scheme, the transaction id and the sequence id will become one... 
      there would ONLY be a transaction id.

    * The data stream can be broken and restarted without loss of 
      information.  Both sides generate a sequential transaction id on
      transmit and receive acks, allowing both sides to determine when
      old data can be thrown away and allowing a connection to be 
      broken, restarted, and resynchronized.

			    How big should a cache ID be?

    So how big should these cache IDs be?  Well, there are numerous
    ways to represent cache elements in a global system.  One could
    attempt to represent the actual disk volume and block offset, for
    example, along with the machine ID and numerous other pieces of
    information.  That would need at least 128 bits and probably more.

    But since I have already settled on a reference counted scheme for
    cached data elements, our cache is going to act more like a computer's
    L1/L2 cache.  That is, it will have a tag structure (containing at 
    least the reference count) and be actively managed rather then
    passively referenced.  The actual data could be stored on-disk or
    in-memory or whatever on the originating machine, but the cache ID
    itself would represent the cache table entry, not the location of
    the actual data.

    For us, this means that our 'global' cache identifiers can be 32 or
    64 bits.  I'll probably use 64 bits.  This would give us room for,
    say, a 16 bit mount ID, a 24 bit machine ID and a 24 bit 'cache line' ID.
    Such IDs would not have to address the data's ultimate storage location,
    they would simply have to identify the machine and 'cache line'. 

    In particular if we consider a userland VFS (one of the mounts within a
    machine), we want the userland VFS to be able to cache a piece of data
    and send the related cache ID out into the ether.  That cache ID could
    end up ANYWHERE within the cluster without an actual data transfer
    occuring until the eventual target decides it needs to see the actual
    data.  If you think of the data is a buffer cache buffer, the actual
    target might not need to see it until it actually decides to write
    it out to its own backing store.  This gives us an unbelievable amount
    of flexibility in the transport of data within a cluster or even
    just within a single machine.  The feature will be especially useful
    when we wrap demand-paged virtual memory around it.

		    Considerations in a clustered environment

    Can such a caching scheme be used in a clustered environment?  Well, that
    is my goal so the answer is yes!  If the cache ID embeds the machine ID
    then each individual machine in the cluster can generate and manage
    unique cache IDs without having to talk to other machines in the
    cluster, and the data related to any given cache ID can always be fetched
    using just the cache ID since the machine that owns that ID would be
    parsed from the ID itself, thus allowing a cache ID to be forwarded to
    any machine in the cluster and yet still represent the same data.

    The problem we face in allowing this sort of global access is how to
    track active cache elements so an originating machine can 'flush' its
    cache efficiently or, even more to the point, to allow all the copies of
    the cache to operate within a cache coherency management scheme.

    If a bit of cached data is passed by reference from A to B, and B then
    forwards that bit of data to C, A has a problem in that it knows how
    many references there are to the cache item, but it does not know 
    which machines are holding those references.  How then does A 'flush'
    that cache element in a 1000-machine cluster?  The clustering topology
    is dynamic, A cannot simply send a flush command to B and have B 
    forward it to C.  It would be possible for C to inform A that it now
    has a reference, and for A to cache this information, but this doesn't
    really scale to situations where, say, 100 machines might be holding a
    copy of a single cache ID.  At some point the cluster could resort to
    a broadcast.

    In anycase, it is a hard problem to solve efficiently but at least there
    are less efficient means (i.e. broadcast to all machines in the
    cluster via spanning tree) to do the required operation so finding a
    solution is a matter of efficiency and not a show stopper.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>





More information about the Kernel mailing list