Cache coherency, clustering, and Kernel virtualization

Sat Sep 2 13:15:42 PDT 2006

Please excuse my newbness --- but how does this differ from UML?

Thanks,
--TongKe

On Sat, Sep 02, 2006 at 11:49:36AM -0700, Matthew Dillon wrote:
>     As people may have noticed, I managed to get the first cut of the
>     cache coherency subsystem in place.  Unfortunately, a great deal more
>     work is needed to get it working fully.   I need to flesh out syslink and
>     work on the cross-machine cache coherency algorithms themselves.  This
>     work is going to be very heavily integrated into the kernel, and it is
>     very complex... so much so that debugging it in an actual kernel (even
>     via VMWARE) would not be all that much fun.  For that matter, it gets
>     even worse once I get to the point where I need to test communication
>     between living systems.
> 
>     So for the last week or two I've been considering my options, and I
>     have finally come up with a plan that will not only make development
>     a whole lot easier, but also give us a nice feather in our cap for our
>     December release!
> 
>     Consider what we want to accomplish.  We want to be able to cut up
>     system resources and link them into 'clusters', with the whole mess
>     tied together on the internet.  Originally I envisioned cutting up
>     memory, disk, and cpu resources and connecting them to a cluster
>     individually, but now I believe what we need to do is connect an
>     entire kernel to the cluster and basically operate as a single system
>     image.
> 
>     Now consider the problem of tying an entire kernel into an internet-based
>     cluster.  Does that sound like something that would be 'safe' to
>     integrate into your real kernel?  NO WAY!  It is virtually impossible
>     to 'secure' a kernel which is operating as a single system image in
>     a cluster of machines connected together via the internet.
> 
>     So what do we do?  Well, I finally figured it out.  It may seem obvious
>     but there were some severe problems I had to work out before I could be
>     sure that it would work.  The coding for it isn't even all that
>     difficult.
> 
>     What we do is we make it so a DragonFly kernel can be compiled and run
>     as a userland application running under the real DragonFly kernel.  As
>     a userland application the virtual kernel can be completely firewalled
>     off from the rest of the system.  The virtual kernel can then be
>     associated with the 'cluster', and managing controlling memory, cpu,
>     and disk resources is a whole lot easier when you have an entire kernel
>     as your funnel into the real system's resources.  If you want to tie
>     into multiple clusters you just create multiple virtual kernels!  More
>     to the point, the technology could be used to partition off major
>     services and EVEN USER LOGINS(!) on a large machine.
> 
>     Sounds kinda like what IBM did with linux on its mainframes, eh?  But I
>     am going to do it with DragonFly and my expectation is that performance
>     within a virtual kernel will be within 20% of the performance of a
>     real kernel.
> 
>     --
> 
>     In order to be able to have a virtual kernel running as a userland 
>     application the virtual kernel must be able to manipulate other VM
>     spaces.  Manipulating other VM spaces means I have to develop new system
>     calls to control VM spaces.  These VM spaces will represent the user
>     processes running under the virtual kernel.  This is where I have been
>     stuck for the last week, trying to figure out how to be able to map
>     memory between the virtual kernel and user processes running under
>     the virtual kernel without blowing away the REAL kernel's memory with
>     millions of VM map and VM object structures.
> 
>     I finally figured it out, and the answer is so simple that I am surprised
>     it took a week for me to figure it out.  The answer is:  You simply do
>     not attempt to represent the memory maps in the VM spaces being
>     controlled by the virtual kernel with real-kernel objects.  Instead,
>     you map the memory into those VM spaces directly via the PMAP subsystem.
> 
>     As some of you may know, the PMAP in a BSD kernel (unlike a Linux kernel)
>     is ephermal... the real kernel can remove mappings at any time and 
>     simply take a page fault to fill them in again.  This means that the
>     real kernel can theoretically support hundreds of thousands of PMAPs
>     and thus allow us to operate pretty much as many virtual kernels and
>     as many virtual processes under those virtual kernels as we wish without
>     blowing up our real kernel.
> 
>     The cost of this method is that when a virtual process running under a
>     virtual kernel takes a page fault, it must chain through the virtual
>     kernel and cannot short-cut directly to the real kernel to handle the
>     page fault.  I do not think this is a big deal considering the number
>     of page table optimizations we already have.
> 
>     This is going to be my goal for our December release... to have userland
>     kernels fully operational.  Development of the syslink and cache coherency
>     technology will go a lot faster once we have virtual kernels.
> 
> 						-Matt
>