syslink effort update
Matthew Dillon
dillon at apollo.backplane.com
Thu Apr 26 19:45:30 PDT 2007
Here's an update on the syslink work. After much thought on how to
best approach the problem of accessing machine resources over a remote
link I finally realized that building a clustered operating system
requires sharing far more then just VM objects, processes, and devices.
For all intents and purposes it requires sharing almost every type of
resource an operating system can have. Here's a short list:
VM spaces
VM objects
VM pages
processes
lwps
vnodes
inodes (e.g. for clustered FS support)
sockets
file descriptor tables
file descriptors
devices
labeled disks (logical abstractions using the 'label' field)
creds
file buffers
file BIOs
... and probably many other things ...
When I first contemplated doing this I came up with SYSLINK, a message
based protocol that can devolve down into almost direct procedure calls
when two localized resources talk to each other.
I am still on track with the SYSLINK concept, but as I continued to
develop it I hit a snag, and I think I finally have a solution to
this snag.
The snag is this: In order to transport requests across a machine
boundary (that is, outside the domain of a direct memory access), it
is necessary to assign a unique identifier to the resource. The easiest
way to think about this is to consider something like NFS. Accessing
a file over NFS requires a name lookup which translates into an
identifier that represents the inode. The NFS client can simply cache
the identifier without having to know much about the complex resource
the identifier represents, other then it is a 'file'.
In order to do this with SYSLINK I was up until today contemplating
reworking all the major system structures so they would use a
'syslink compatible' API. That would mean changing DEVOPS, VOPS,
file descriptor access routines, and so on and so forth. I didn't
quite realize that there were over a dozen (maybe even two dozen)
different structures that would need to be redone.
Well, reworking two dozen structures is out of the question. I'd like
to get this done before I start looking like rumplestiltskin! Hence
the hair pulling.
--
Today I came up with something that IS possible to do in a more
reasonable time frame. I'm kinda kicking myself for not thinking of it
sooner.
Instead of reworking all the APIs I am instead going to rework JUST the
reference counting methodology used in these resource structures. Right
now all the resource structures roll their own ref counting mechanisms.
That's all going to be replaced with a common ref counting API and a
little structure that includes a 64 bit unique sysid, red-black tree
node, the ref count, and a pointer to a resource type structure (e.g.
identifying it as a vnode, vm object, or whatever).
When any of the above resources are allocated, they will be indexed in
a Red-Black tree. In other words it will be possible to identify every
single resource in the system by traversing the red-black tree, which
means it will be possible to lookup ANY resource in the system by its
sysid using a red-black tree lookup!
I am going to implement a per-cpu Red-Black tree and use critical
sections to control access to it. All resources will be registered
when they are allocated, and deregistered when they are released.
Use of a per-cpu RB tree will mean no lock contention on the RB tree
itself and cross cpu releases will just use a passive IPI, which costs
us almost nothing. The ref count field will be buslocked or spinlocks
but I don't expect that to create a contention issue.
What does that mean for SYSLINK? It means that all of a system's
resources will now become addressable via a 64 bit id and thus will
be suitably represented in any remote protocol.
The 64 bit sysids will be unique, which is a very simple mechanism...
each cpu just initializes a 64 bit sysid to a shifted timestamp on
boot, and then increments it by <ncpus> to 'allocate' a sysid. You
can't get much simpler then that. I don't think it would be possible
to overflow a 64 bit counter, even incrementing by ncpus, without at
least a hundred years of uptime and I'm just not worried about a
hundred years of uptime for a single host.
The uniqueness means that remote accesses will not go accessing the
wrong resource because a sysids will never be reused, which means
that the sysid can represent a stable resource from the point of view
of any remote accessor, and the remote accessor can be told if/when
it goes away.
And that, folks, gives us the building blocks we need to represent
resources in a cluster.
This also means I don't have to rewrite the APIs. Instead I can simply
write new RPC APIs for accesses made via syslink ids and, poof, now all
of a system's resources will become accessible remotely, with only
modest effort.
So this will be the next step for me. Implementing the global
registration, reference counting, and allocation and disposal API.
I'm gonna call it 'sysreg', and the first commits are going to occur
in the next few days because I don't expect it to be very difficult
to implement. I'm pretty excited.
* Implement sysreg
* Start converting structure refcount & allocation APIs to the
sysreg API.
* Build a local syslink VFS and DEV interface
* Build a remote VFS and DEV interface via TCP (like NFS)
* continue working on the things needed for clustering, like
the syslink mesh, packetized messaging protocols, and so on ...
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Kernel
mailing list