Cache coherency, clustering, and Kernel virtualization

Matthew Dillon dillon at apollo.backplane.com
Sat Sep 2 12:03:43 PDT 2006


    As people may have noticed, I managed to get the first cut of the
    cache coherency subsystem in place.  Unfortunately, a great deal more
    work is needed to get it working fully.   I need to flesh out syslink and
    work on the cross-machine cache coherency algorithms themselves.  This
    work is going to be very heavily integrated into the kernel, and it is
    very complex... so much so that debugging it in an actual kernel (even
    via VMWARE) would not be all that much fun.  For that matter, it gets
    even worse once I get to the point where I need to test communication
    between living systems.

    So for the last week or two I've been considering my options, and I
    have finally come up with a plan that will not only make development
    a whole lot easier, but also give us a nice feather in our cap for our
    December release!

    Consider what we want to accomplish.  We want to be able to cut up
    system resources and link them into 'clusters', with the whole mess
    tied together on the internet.  Originally I envisioned cutting up
    memory, disk, and cpu resources and connecting them to a cluster
    individually, but now I believe what we need to do is connect an
    entire kernel to the cluster and basically operate as a single system
    image.

    Now consider the problem of tying an entire kernel into an internet-based
    cluster.  Does that sound like something that would be 'safe' to
    integrate into your real kernel?  NO WAY!  It is virtually impossible
    to 'secure' a kernel which is operating as a single system image in
    a cluster of machines connected together via the internet.

    So what do we do?  Well, I finally figured it out.  It may seem obvious
    but there were some severe problems I had to work out before I could be
    sure that it would work.  The coding for it isn't even all that
    difficult.

    What we do is we make it so a DragonFly kernel can be compiled and run
    as a userland application running under the real DragonFly kernel.  As
    a userland application the virtual kernel can be completely firewalled
    off from the rest of the system.  The virtual kernel can then be
    associated with the 'cluster', and managing controlling memory, cpu,
    and disk resources is a whole lot easier when you have an entire kernel
    as your funnel into the real system's resources.  If you want to tie
    into multiple clusters you just create multiple virtual kernels!  More
    to the point, the technology could be used to partition off major
    services and EVEN USER LOGINS(!) on a large machine.

    Sounds kinda like what IBM did with linux on its mainframes, eh?  But I
    am going to do it with DragonFly and my expectation is that performance
    within a virtual kernel will be within 20% of the performance of a
    real kernel.

    --

    In order to be able to have a virtual kernel running as a userland 
    application the virtual kernel must be able to manipulate other VM
    spaces.  Manipulating other VM spaces means I have to develop new system
    calls to control VM spaces.  These VM spaces will represent the user
    processes running under the virtual kernel.  This is where I have been
    stuck for the last week, trying to figure out how to be able to map
    memory between the virtual kernel and user processes running under
    the virtual kernel without blowing away the REAL kernel's memory with
    millions of VM map and VM object structures.

    I finally figured it out, and the answer is so simple that I am surprised
    it took a week for me to figure it out.  The answer is:  You simply do
    not attempt to represent the memory maps in the VM spaces being
    controlled by the virtual kernel with real-kernel objects.  Instead,
    you map the memory into those VM spaces directly via the PMAP subsystem.

    As some of you may know, the PMAP in a BSD kernel (unlike a Linux kernel)
    is ephermal... the real kernel can remove mappings at any time and 
    simply take a page fault to fill them in again.  This means that the
    real kernel can theoretically support hundreds of thousands of PMAPs
    and thus allow us to operate pretty much as many virtual kernels and
    as many virtual processes under those virtual kernels as we wish without
    blowing up our real kernel.

    The cost of this method is that when a virtual process running under a
    virtual kernel takes a page fault, it must chain through the virtual
    kernel and cannot short-cut directly to the real kernel to handle the
    page fault.  I do not think this is a big deal considering the number
    of page table optimizations we already have.

    This is going to be my goal for our December release... to have userland
    kernels fully operational.  Development of the syslink and cache coherency
    technology will go a lot faster once we have virtual kernels.

						-Matt






More information about the Kernel mailing list