The Clustering and Userland VFS transport protocol - summary

Fri May 12 03:30:57 PDT 2006

On 5/11/06, Matthew Dillon <dillon at xxxxxxxxxxxxxxxxxxxx> wrote:
:Matthew Dillon wrote:
:>     In a clustered environment the execution context (what 'cp' is actually
:>     running on) can be anywhere. But there is absolutely no reason for the
:>     file data to physically pass through that machine if 'cp' itself does
:>     not need to know what the file contains.  If done properly, the actual
:>     file data would be transported directly from machine A to machine B,
:>     or stay strictly within machine A in the second example.
:
:Are such operations going to be exposed through system calls? In other
:words, does this mean that userland utilities will need to be modified
:to fully support (efficiently) this type of copy by reference?
    No.  Only userland programs acting as data sources or data sinks
    via the protocol (i.e. a userland VFS or cluster related processes).
    Something like 'cp' would just use read() and write() or mmap() and
    write().  The key to making something like this work with 'cp' is
    that the kernel would not instantiate the VM pages backing the
    buffer being read into or the VM pages backing the memory map.  So
    it would be possible for 'cp' to read/mmap and write without ever
    touching the actual data.
    That's just an example.  It would be fairly complex to actually make
    it work with something like read(), but the mmap/write combination is
    far more achievable since VM objects are already hierarchical and
    would be fairly easy to 'back' with a cache line ID.
:What level of transactional support will be provided? For example, will
:the cp utility return before or after the data itself is made durable?
:Will it be possible for the cp utility to complete successfully, have
:the node containing the referenced cache data fail and thus the
:transaction fail after the fact?
    That would be up to the utility.  'cp' doesn't guarentee that an
    operation is made durable even now since most of the data winds up
    in the buffer cache.  Userland would have to perform a sync or fsync
    of some sort to make the data durable.
:What are the error recovery/failure scenarios in the case that a node
:with the only copy of referenced cached data fails?
:
:Best of luck with your work, and thank you!
:- Jason
    This falls into the category of 'complex issues that one has to deal
    with to make a clustering system robust'.  It is less of a problem
    within a machine, even with a userland process (aka a userland VFS)
    acting as the data source.  If the data is unrecoverably lost then
    whatever is trying to access it would probably end up having to seg-fault.
    The key is to manage the state of the cache id based on the needs of
    the holder.
    So, for example, lets say you have two userland VFS mounts and you
    are copying data from one to the other via the kernel.  UVFS(A) passes
    a cache ID to the kernel which forwards it to UVFS(B).  Several things
    can happen asynchronously:
    * UVFS(A) can decide that it has to flush the data.  It sends a cache
      flush to the kernel which forwards it to UVFS(B), which forces UBFS(B)
      to read the actual data from UVFS(A) and then de-ref UVFS(A)'s cache
      ID semi-synchronously.
    * UVFS(B) can decide to instantiate its own copy of the cached data.
      It would send cache ID read command to UVFS(A) to get the data and
      then de-ref UVFS(A)'s cache ID.
    * Either UVFS(A) or UVFS(B) could decide that these things need to be
      done as part of the operating that originally requested the data,
      making the data access effectively synchronous.  A failure would
      caues the original operation to fail (verses a cache ID getting
      forwarded through many subsystems and the failure occuring at some
      later time in a seemingly unrelated subsystem).
    * If UVFS(A) crashes and burns the data is not necessarily lost.  e.g.
      if UVFS(A) seg-faults and generates a core, the data would still be
      retrievable after restart.  If the cache line IDs represent data in
      backing store that UVFS(A) hasn't itself read yet, then the data would
      also still be retrievable.
    If one expands this to a clustered system, and one assumes that the
    cache line data is not recoverable if a machine crashes, then the issue
    becomes one of redundancy.
    The data represented by a cache line ID as described in my original
    posting *CAN* be cached by multiple machines in the cluster.   The
    capability is there, but the algorithms to use this feature effectively
    are probably going to end up being fairly complex.  They are as-yet
    unresearched.
                                        -Matt
                                        Matthew Dillon
                                        <dillon at xxxxxxxxxxxxx>
Unresearched... Matt, didn't you have a distributed multi-master
database for backplane? If we just consider each memory page a row in
a table, then the same algorithms used for the database could be used
for the memory coherency; that's not to mean the actual implementation
may be more complex due to pagefaults and related stuff
--
Greetz, Antonio Vargas aka winden of network
http://wind.codepixel.com/
windNOenSPAMntw at xxxxxxxxx
thesameasabove at xxxxxxxxxxxxx
Every day, every year
you have to work
you have to study
you have to scene.