Cache coherency, clustering, and Kernel virtualization
Matthew Dillon
dillon at apollo.backplane.com
Sat Sep 2 12:03:43 PDT 2006
As people may have noticed, I managed to get the first cut of the
cache coherency subsystem in place. Unfortunately, a great deal more
work is needed to get it working fully. I need to flesh out syslink and
work on the cross-machine cache coherency algorithms themselves. This
work is going to be very heavily integrated into the kernel, and it is
very complex... so much so that debugging it in an actual kernel (even
via VMWARE) would not be all that much fun. For that matter, it gets
even worse once I get to the point where I need to test communication
between living systems.
So for the last week or two I've been considering my options, and I
have finally come up with a plan that will not only make development
a whole lot easier, but also give us a nice feather in our cap for our
December release!
Consider what we want to accomplish. We want to be able to cut up
system resources and link them into 'clusters', with the whole mess
tied together on the internet. Originally I envisioned cutting up
memory, disk, and cpu resources and connecting them to a cluster
individually, but now I believe what we need to do is connect an
entire kernel to the cluster and basically operate as a single system
image.
Now consider the problem of tying an entire kernel into an internet-based
cluster. Does that sound like something that would be 'safe' to
integrate into your real kernel? NO WAY! It is virtually impossible
to 'secure' a kernel which is operating as a single system image in
a cluster of machines connected together via the internet.
So what do we do? Well, I finally figured it out. It may seem obvious
but there were some severe problems I had to work out before I could be
sure that it would work. The coding for it isn't even all that
difficult.
What we do is we make it so a DragonFly kernel can be compiled and run
as a userland application running under the real DragonFly kernel. As
a userland application the virtual kernel can be completely firewalled
off from the rest of the system. The virtual kernel can then be
associated with the 'cluster', and managing controlling memory, cpu,
and disk resources is a whole lot easier when you have an entire kernel
as your funnel into the real system's resources. If you want to tie
into multiple clusters you just create multiple virtual kernels! More
to the point, the technology could be used to partition off major
services and EVEN USER LOGINS(!) on a large machine.
Sounds kinda like what IBM did with linux on its mainframes, eh? But I
am going to do it with DragonFly and my expectation is that performance
within a virtual kernel will be within 20% of the performance of a
real kernel.
--
In order to be able to have a virtual kernel running as a userland
application the virtual kernel must be able to manipulate other VM
spaces. Manipulating other VM spaces means I have to develop new system
calls to control VM spaces. These VM spaces will represent the user
processes running under the virtual kernel. This is where I have been
stuck for the last week, trying to figure out how to be able to map
memory between the virtual kernel and user processes running under
the virtual kernel without blowing away the REAL kernel's memory with
millions of VM map and VM object structures.
I finally figured it out, and the answer is so simple that I am surprised
it took a week for me to figure it out. The answer is: You simply do
not attempt to represent the memory maps in the VM spaces being
controlled by the virtual kernel with real-kernel objects. Instead,
you map the memory into those VM spaces directly via the PMAP subsystem.
As some of you may know, the PMAP in a BSD kernel (unlike a Linux kernel)
is ephermal... the real kernel can remove mappings at any time and
simply take a page fault to fill them in again. This means that the
real kernel can theoretically support hundreds of thousands of PMAPs
and thus allow us to operate pretty much as many virtual kernels and
as many virtual processes under those virtual kernels as we wish without
blowing up our real kernel.
The cost of this method is that when a virtual process running under a
virtual kernel takes a page fault, it must chain through the virtual
kernel and cannot short-cut directly to the real kernel to handle the
page fault. I do not think this is a big deal considering the number
of page table optimizations we already have.
This is going to be my goal for our December release... to have userland
kernels fully operational. Development of the syslink and cache coherency
technology will go a lot faster once we have virtual kernels.
-Matt
More information about the Kernel
mailing list