Reviving userland LWKT

Wed Jul 12 13:53:52 PDT 2006

:
:Hi all,
:
:I'm interested in bringing back to life the port of LWKT to userland
:(the old "libcaps") - I know a lot of this stuff hasn't been a high
:priority for quite some time, but it was a cool unique Dragonfly feature
:and I'd hate to have it go to bit-rot as the rest of the system
:continues to mature.
:
:So, anyone have any ideas, suggestions or requests going out? I realize
:the sysport/sendsys stuff has been removed and I am prepared to deal
:with that (I'm not sure it ever worked anyway, did it?) My plan of
:attack focuses on usermode<->usermode intraprocess and interprocess
:messaging first, leaving the whole issue of what to do with the kernel
:interfaces for last.
:
:Thanks,
:-Eric

    Instead of doing that, perhaps you would be interested in something
    even MORE interesting, but very similar -- userland VFS!

    We need a stream based command/response mechanism that basically
    works similarly to the current filesystem journaling protocol but
    allows bi-directional command initiation, parallel commands, etc etc.
    Initially I just want the protocol to operate over a pipe or socket,
    and not via shared memory, but the protocol also has to eventually
    be extensible to a shared memory FIFO implementation which creates
    certain limitations.

    We need this protocol for both userland VFS and for data links 
    between machines in a clustered environment.  The idea is fairly
    simple in concept, but there are a lot of fine details involved, in
    particular in dealing with link breakages and reconnects.

    Each command would be constructed as an encapsulated message using
    a recursive structure, like this:

    msg {
	linkid 	   (64 bits)	(specifies the communications end point)
	msgid	   (32 bits)	(allows parallel commands to be issued)
	command    (16 bits)	(bit 15 indicates a response)
	length	   (16 bits)
	item {
		itemid	(16 bits)	(bit 15 indicates item recursion)
					(bit 14 indicates ref'd data)
		itemlen	(16 bits)
		data[]			(recursive item if item recursion)
	}
    }

    Messages would be limited to 65535 bytes (so, basically, a message
    embedding a data buffer would be limited to a 32 KB data buffer). 
    Messages representing large I/O's (such as someone doing a 2GB read()
    through to a userland VFS) would transfer the potentially large amount
    of data using a separate linkid that represents a 'data cache object'.

    To support a future shared memory FIFO API messages will require specific
    alignments so as not to create cases that either overflow a shared memory
    FIFO or cross the boundary from the end of the FIFO's buffer back to
    the beginning.  This part is fairly simple.  The 'length' field simply
    represents the logical length of the message (e.g. like 27 bytes), but
    the actual formatting of the message within the stream would always be
    8-byte aligned (e.g. nextoffset = currentoffset + (msg->length + 7) & ~7).
    Same with the recursive items.

    This would be a layered protocol, with the lowest layer simply 
    encapsulating the core messaging protocol.  A layer on top of that
    would maintain and/or route end-points via the linkid.  A layer on top
    of that would implement disconnection/reconnection/connection-failure
    handling.  Then command/response protocols would be layered on top
    of that to implement infrastructure elements such as 'cache' objects
    (an obect that represents a chunk of data), VFS commands, and so on
    and so forth.

    In order to be able to pass new linkid's as data[] the core protocol
    would have to reserve a bit in the itemid to indicate that the item
    or item recursion contains linkid's (in order to maintain reference
    counts on active connections and cache objects and such).

    --

    Ok, that may be somewhat confusing, but hopefully you can see what I
    am getting at.  This is a big ticket need for DragonFly... we need 
    it for a userland VFS implementation, and we need it for inter-machine
    links in a cluster.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>