[GSOC] Implement hardware nested page table support for vkernels

Fri Sep 20 10:36:32 PDT 2013

Hi all,

This last week I've worked at testing the vkernel on various Intel
platforms and fixed some revealed bugs (ex.: the "reboot" command wasn't
working due to the fact that execve that was used to reinit the vkernel,
was executed by VMM, but in the same time execve was cleaning up resources,
included VMM metadata -> failure).

I've implemented the "int 3" instruction which was used by vkernel to enter
"db>" (we can test it by issuing Ctrl+\).

Dillon helped me to improve pmap_inval code by not doing any cpusync when
the PMAP was owned only by one process (which is the common case when
speaking about normal processes). Dillon also added a heuristic
vm_fault_page_quick to the vm_fault_page in order to remove course-grained
locking when it's not needed. The vm_fault_page was intensively used by
ept_copyin/ept_copyout and umtx_* calls causing a lot of token collisions.
 We now have a small IPI rate and and small token collisions.

There is an issue when the host is paging out memory. In this case the
vkernel fails in random places, but this issue is hapenning with the
NON-VMM vkernel, too, so it's an old one. I didn't investigate it and
didn't include it in the test-cases (as Matt suggested).

Let's talk about the results:
1) Dfly haswell blades - vkernel with 2 GB of RAM and 8vCPU - make
nativekernel -j 4
   a) VMM - an average of 355secs (from 6 runs)
   b) NON-VMM - an average of 390secs (from 6 runs)
   As one can see the perfomance improvement is of 10%, due to the fact
that with VMM we don't have such a lock contention created by the vmspace_*
calls. But this is the best performance obtained. If we rebooted the
machine, the results vary, but still better than the NON-VMM.

2) Core i5 (2500) - vkernel with 1 GB of RAM and 8 vCPU - make nativekernel
-j 4
   Here the results were in the time range of 490s-500s for VMM and
NON-VMM. There were no visible improvements.
   I've put only 1GB of RAM to not pageout (the host had only 3GB of RAM)

3) Xeon (E5 2609) - vkernel with 2 GB of RAM and 8 vCPU - make nativekernel
-j 4
   Here the results were in the time range of 800s for VMM and NON-VMM.

First of all, even though we have the same vkernel configuration the
differences between the testing machines are from the generation of the
procesors (mine are much older than Dfly blades). In particular the Xeon
has less perfomance because the frequency is only 2.4Ghz (compared to
3.XGhz with the Core i5).

By default, if the the VMX support is present in the hardware, the
vkernel64 will use that support. I preserved compatiblity with the NON-VMM
vkernel by adding the sysctl hw.vmm.enable option (0 - disabled, 1 -
enabled). Also sysctl hw.vmm shows all kind of numbers related to the VMX
available configurations. So the hw.vmm.enable is the only option you can
control.

Even though we didn't obtained much better perfomance, the project brings
in the following important beneficts:
- the memory allocated to the vkernel is the ONLY one used by the vkernel.
In the NON-VMM vkernel we have one pagetable in the host for each pagetable
in the guest (shadow pagetables), these shadow pagetables not being
included in the memory configured to the vkernel
- we have "guest-user" virtualization support, setting up a future work
ramp. We have the EPT support, but we do need a mechanism to switch between
vkernel user-space processes without calling the host. Right now the
threads are switched by calling vmspace_ctl (the only vmspace_* call that
is issued in the VMM mode).

There are two big next steps in this project: implement AMD SVM support and
make the vkernel "guest-kernel" virtualization (the vkernel is the only one
that controls its environment, making the thread switching by itself,
getting the page-faults directly, not via signals and more other things.).

My opinion is that vkernel "guest-kernel" virtualization is more important
and needs to be tackled before the AMD SVM. Basic steps would be creating a
new platform (ex.: vkernl64hvm), importing the initialization code from
pc64 (like global descriptor table, interrupt descriptor table, etc),
importing the thread switch code, importing the pmap code and deal with the
threads that must communicate with the host (like the storage cothread
which reads the image directly from the host). Anyway this is a schematic
approach (just for your curiosity).

In the AMD SVM Venkatesh did some initialization setup and we think that
adding support won't be much of a trouble (I guess much more easy than
Intel twisted VMX implementation).

This was my last report from GSoC project. Soon, Matt will merge my work
into the master tree. Any bugs you would find you can reply here, e-mail me
directly or speak on the IRC channel.

I want to thank Venkatesh and Matthew for guiding me through this project.
Was a little hard, but great project (AlexH proposal - thanks him for
guiding me to choose this one:) ).

Thanks,
Mihai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/kernel/attachments/20130920/9defb52f/attachment.html>