Reviving userland LWKT

Matthew Dillon dillon at
Wed Jul 12 16:36:16 PDT 2006

:In actuality that is very similar to ideas I've been throwing around
:regarding the messaging implementation. I'm a big fan of TLV-based
:protocols, and idea just recently occurred to me also to have the
:lower-layered protocols interpret the leftmost X bits of the type field
:to decide how to translate the message for transmission while the
:detailed format of the message remains opaque. Fantastic!

    Yes, exactly!  And we need 64 bits for that (at least) in order to be
    able to embed the logical host id, subsystem id, object index, 
    possibly also a boot counter or timestamp of some sort to properly
    detect stale objects (for inter-machine communications or when the
    link gets lost and later reconnects), maybe a few bits to identify
    the structural type (vnode, VM object, descriptor, or something more
    opaque), etc.

    The data isn't quite opaque, however.  It can't be.  I'll address
    that down below.

:It makes a lot of sense actually -- I will take a closer look at the
:details. I did not have VFS as a particular application in mind; I
:wanted the protocol to be general, but VFS is certainly applicable.
:My big question for you is why do you mention this idea as alternative
:to the idea of implementing LWKT? It seems like LWKT would be a
:necessary platform to build this kind of service on. What would the
:userland VFS drivers use for m:n threading and asynchronous requests?
: -Eric

    We can still implement LWKT, but LWKT can't be the lowest layer 
    because it isn't a generic transport.  An LWKT message might contain
    pointers to other structures, pointers to strings, etc... you can't
    just bcopy() it into a buffer and transmit it to another host.

    A stream or memory FIFO is a generic transport.  I think it is
    very important to define it at the lowest layer (e.g. the 
    streaming/memory-FIFO interface) because that is the layer where we
    can really tune the system for performance.  LWKT messages may be a
    good abstraction for userland or even for the kernel, but they need
    to be translated in the transport layer.  NOTE that, of course,
    if the transport layer is just passing the message between two 
    threads or something like that, then no translation would be 
    required.  But in ordre to make this generic there has to be
    a translation layer of some sort (even if it is a NOP in some cases).

    More on data opaqueness.  We can't *quite* make the data opaque.  My
    original posting noted some reserved bits:

   msg {
        linkid     (64 bits)    (specifies the communications end point)
        msgid      (32 bits)    (allows parallel commands to be issued)
        command    (16 bits)    (bit 15 indicates a response)
				(this field is also the error code on response)
        length     (16 bits)
        item {
                itemid  (16 bits)       (bit 15 indicates item recursion)
                                        (bit 14 indicates ref'd data)
                itemlen (16 bits)
                data[]                  (recursive item if item recursion)

    bit 15 in the command, to indicate a command or response for the msgid,
    is pretty obvious.  Each msgid represents a single transaction, and
    clearly we need to be able to have multiple transactions running in
    parallel on any given object (linkid), hence they are separate fields.

    But lets look at bits 14 and 15 in the itemid.  The item { } can be a
    recursive structure.  If bit 15 is set in the itemid then the data[]
    consists of zero or more (recursive) item { }'s.  Otherwise it indicates
    relatively opaque data.  That part is fairly obvious too.

    But bit 14 is not so obvious.   This protocol is going to be used to
    pass all sorts of object references around.  An object reference
    is just a 'linkid', but the protocol needs to be able to identify
    which data elements are linkid's in order to properly keep track of
    them.  In particular, in order to track a reference count for them.
    (If bit 15 and bit 14 are both set, it indicates that there is at
    least one linkid somewhere in the recursive sub-tree.  If just bit
    14 is set, it indicates that the data[] represents a linkid.

    Here's an example:

    client sends CMD=OPEN DATA="a/b/c"
    server responds LINKID(bit14set) DATA=<linkid_of_open_file>

    In this example the server returns an object reference to the client,
    a linkid representing the open file.  Clearly this has to be tracked
    so the server knows when it can destroy the object (vnode) represented
    by the linkid.

    Now normally you might think that, ok, well, this could be tracked in
    higher layers.  But it actually has to be tracked by the transport layer
    as well as higher layers for two reasons:

    (1) Because the transport layer, or some layer just above it (but below
    the API/VFS-interface layer/whatever)... that layer needs to deal with
    disconnects and reconnects.  It needs to deal with resychronization as
    well.  In short, some level of robustness.

    (2) Because we are using a flexible recursive data structure, and
    because the client and server may be running different versions of
    a particular command, one of the communications protocol might not
    be able to completely parse a message sent by the other end.  If
    a message cannot be completely parsed, the highest layer (i.e. the
    code implementing 'open' or 'read') might 'miss' an object reference
    that is passed to it.

    For example, lets say we have a UNIX box and an APPLE box talking to
    each other and the UNIX box sends a cmd=OPEN request and the APPLE
    box returns two link references in two item { } structures instead of
    one, say to represent two data forks for the file.  If the UNIX box
    doesn't understand two data forks it won't properly ref count the
    second linkid reference.  BUT since the linkid reference is defined by
    the low level protocol, the protocol *WILL* be able to properly keep
    track of the reference and will be able to properly dereference it or
    whatever if the higher protocol layer didn't pick up the object.

    Another example... say we do a 'stat' command.  The server might return
    a recursive item { } structure containing items for each stat field
    (size, modes, owner, etc).  The server might contain item structures
    that the client does not recognize.  The client needs to simply be
    able to ignore the sub elements it does not recognize.

    So by making the data slightly non-opaque.  Just slightly, we can
    develop protocols which interoperate over many releases.  It is very
    important that 'old' machines be able to talk to 'new' machines and

					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>

More information about the Kernel mailing list