syslink effort update

Thu Apr 26 19:45:30 PDT 2007

    Here's an update on the syslink work.  After much thought on how to
    best approach the problem of accessing machine resources over a remote
    link I finally realized that building a clustered operating system
    requires sharing far more then just VM objects, processes, and devices.

    For all intents and purposes it requires sharing almost every type of
    resource an operating system can have.  Here's a short list:

	VM spaces
	VM objects
	VM pages
	processes
	lwps
	vnodes
	inodes (e.g. for clustered FS support)
	sockets
	file descriptor tables
	file descriptors
	devices
	labeled disks (logical abstractions using the 'label' field)
	creds
	file buffers
	file BIOs
	... and probably many other things ...

    When I first contemplated doing this I came up with SYSLINK, a message
    based protocol that can devolve down into almost direct procedure calls
    when two localized resources talk to each other.

    I am still on track with the SYSLINK concept, but as I continued to
    develop it I hit a snag, and I think I finally have a solution to
    this snag.

    The snag is this: In order to transport requests across a machine
    boundary (that is, outside the domain of a direct memory access), it
    is necessary to assign a unique identifier to the resource.   The easiest
    way to think about this is to consider something like NFS.  Accessing
    a file over NFS requires a name lookup which translates into an
    identifier that represents the inode.  The NFS client can simply cache
    the identifier without having to know much about the complex resource
    the identifier represents, other then it is a 'file'.

    In order to do this with SYSLINK I was up until today contemplating 
    reworking all the major system structures so they would use a
    'syslink compatible' API.  That would mean changing DEVOPS, VOPS,
    file descriptor access routines, and so on and so forth.  I didn't
    quite realize that there were over a dozen (maybe even two dozen)
    different structures that would need to be redone.

    Well, reworking two dozen structures is out of the question.  I'd like
    to get this done before I start looking like rumplestiltskin!  Hence
    the hair pulling.

    --

    Today I came up with something that IS possible to do in a more
    reasonable time frame.  I'm kinda kicking myself for not thinking of it
    sooner.

    Instead of reworking all the APIs I am instead going to rework JUST the
    reference counting methodology used in these resource structures.  Right
    now all the resource structures roll their own ref counting mechanisms.

    That's all going to be replaced with a common ref counting API and a
    little structure that includes a 64 bit unique sysid, red-black tree
    node, the ref count, and a pointer to a resource type structure (e.g.
    identifying it as a vnode, vm object, or whatever).

    When any of the above resources are allocated, they will be indexed in
    a Red-Black tree.  In other words it will be possible to identify every
    single resource in the system by traversing the red-black tree, which
    means it will be possible to lookup ANY resource in the system by its
    sysid using a red-black tree lookup!

    I am going to implement a per-cpu Red-Black tree and use critical
    sections to control access to it.  All resources will be registered
    when they are allocated, and deregistered when they are released. 
    Use of a per-cpu RB tree will mean no lock contention on the RB tree
    itself and cross cpu releases will just use a passive IPI, which costs
    us almost nothing.  The ref count field will be buslocked or spinlocks
    but I don't expect that to create a contention issue.

    What does that mean for SYSLINK?  It means that all of a system's
    resources will now become addressable via a 64 bit id and thus will
    be suitably represented in any remote protocol.

    The 64 bit sysids will be unique, which is a very simple mechanism...
    each cpu just initializes a 64 bit sysid to a shifted timestamp on
    boot, and then increments it by <ncpus> to 'allocate' a sysid.  You
    can't get much simpler then that.  I don't think it would be possible
    to overflow a 64 bit counter, even incrementing by ncpus, without at
    least a hundred years of uptime and I'm just not worried about a
    hundred years of uptime for a single host.

    The uniqueness means that remote accesses will not go accessing the
    wrong resource because a sysids will never be reused, which means
    that the sysid can represent a stable resource from the point of view
    of any remote accessor, and the remote accessor can be told if/when
    it goes away.

    And that, folks, gives us the building blocks we need to represent
    resources in a cluster.

    This also means I don't have to rewrite the APIs.  Instead I can simply
    write new RPC APIs for accesses made via syslink ids and, poof, now all
    of a system's resources will become accessible remotely, with only
    modest effort.

    So this will be the next step for me.  Implementing the global 
    registration, reference counting, and allocation and disposal API.
    I'm gonna call it 'sysreg', and the first commits are going to occur
    in the next few days because I don't expect it to be very difficult
    to implement.  I'm pretty excited.

    * Implement sysreg
    * Start converting structure refcount & allocation APIs to the
      sysreg API.
    * Build a local syslink VFS and DEV interface
    * Build a remote VFS and DEV interface via TCP (like NFS)
    * continue working on the things needed for clustering, like
      the syslink mesh, packetized messaging protocols, and so on ...


					-Matt
					Matthew Dillon 
					<dillon at backplane.com>