<div dir="ltr">Hello,<div><br></div><div class="gmail_extra"><br><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>Also Dillon did a little refactoring to my code, modifying the way I cached some attributes, in order to modify those attributes only and only when is needed. I didn't integrate the code to my branch yet (I took only some pieces of it). I will do this after I solve the bug.</div>


<div><br></div><div>After solving the bug, the next big steps are starting configuring EPT and see what can I reuse from the "old" pmap code. </div></div></div></div></blockquote><div>After loosing a lot of debugging time (mine and Matthew and Venkatesh) and with the help of Matthew's ideas I guess I've managed to solve the bug. The behaviour of the host machine was very strange. After launching a vkernel with 6 threads, the host machine was starting to be less responsive and the vkernel threads were starting to get the CPU and never release it (the CPUTIME was getting only bigger). First, Matthew pointed out that there is a very large amount of page-faults in the system and may be I don't resolve the page-faults well (missing some attributes when calling the page-fault code). I've rechecked this and all it was fine. I've also checked my VMM path for page-fault handling and there weren't so much page-faults that were causing VMEXITs (page-faults generated by the vkernel itself). Today Matthew saw that there were a lot of page-faults (1mil/sec) and the vast majority in copyin(). The copyin was faulting at the same address for a certain amount of time and than miracously got solved and go further with the code execution. We took in consideration the possible invalid TLB (I've also done some tests last week with invalidating the TLB at each page-fault), but this was not the case. At each page-fault I've checked if the virtual address is already in the page table and if it's with the right attributes. And it was (basically the VA was mapped). This took me to think about the CR3 content. May be the CR3 is pointing to another page table. I've double checked the CR3 when entering the GUEST MODE (and also did this a lot of times last week). The problem was with the CR3 of the HOST and the fact that vmspace_ctl was writing a new CR3 and I was loading that CR3 in the VMCS structure only when certain conditions happened (moving on another CPU for example). When the VMX module was restoring the CR3, was restoring a bad one, pointing to a wrong page table. And this is why in some cases the CR3 was pointing to the wrong page table making the kernel page-faulting as a crazy guy.</div>

<div><br></div><div>I said at the beginning that "I guess" because I haven't tested much.</div><div><br></div><div>So teoretically now the vkernel runs in VMX non-root operation mode with SMP and I will start enabling EPT and adding code paths for playing with it (probably I will start with a little test program, not with the vkernel it self).</div>

<div><br></div><div>I've made a commit with the code that was refactored by Dillon, the bug solving presented above and another bug Dillon found. The FS base was cached along context switches and the VMX module was loading it from the VMCS structure every time with 0, causing invalid accesses. The issue was solved by writing at each vmenter the mdcpu->gd_user_fs.</div>

<div><br></div><div>Thanks,</div><div>Mihai</div></div></div></div>