<div dir="ltr">Master has received some major VM work so please take care if you decide to update or upgrade your system, it may lose a little stability.  A full buildworld and buildkernel is needed due to internal structural changes.  The work is also not entirely complete, there are two or three memory conservation routines that have not been put back in yet.  That said, the work looks pretty solid under brute force testing.<div><br></div><div>The new work going in basically rewrites the handling of leaf PTEs in the pmap subsystem.  Each vm_page entered into the MMU's pmap used to be tracked with a 'pv_entry' structure.  The new work gets rid of these tracking structures for leaf pages.  This saves memory, helps deal with certain degenerate situations when many processes share lots of memory, and significantly improves concurrent page fault performance because we no longer have to do any list manipulation on a per-page basis.</div><div><br></div><div>Replacing this old system is a new system where we use vm_map_backing structures which hang off of vm_map_entry's... essentially one structure for each 'whole mmap() operation', with some replication for copy-on-write shadowing.  So, instead of having a structure for each individual page in each individual pmap, we now have a single structure that covers potentially many pages.  The new tracking structures are locked, but the number of lock operations is reduced by a factor of 100 (at least), or even better.</div><div><br></div><div>Currently the committed work is undergoing stability testing and there will be follow-up commits to fix things like minor memory leaks and so forth, so expect those to be incoming.</div><div><br></div><div>Work still to do:</div><div><br></div><div>* I need to optimize vm_fault_collapse() to retain backing vnodes.  Currently any shadow object chain deeper than 5 causes the entry to fault all pages to the front object and then disconnect the backing objects.  But this includes the terminal vnode object which I don't actually want to include.</div><div><br></div><div>* I need to put page table pruning back in (right now empty page table pages are just left in the pmap until exit() to avoid racing the pmap's pmap_page_*() code)</div><div><br></div><div>* I need to implement a new algorithm to locate and destroy completely shadowed anonymous pages.</div><div><br></div><div>None of this is critical for the majority of use cases, though.  The vm_object shadowing code does limit the depth so completely shadowed objects won't just build up forever.</div><div><br></div><div>--</div><div><br></div><div>These changes significantly improve page fault performance, particularly under heavy concurrent loads.</div><div><br></div><div>* kernel overhead during the 'synth everything' bulk build is now under 15% system time.  It used to be over 20%.  (system time / (system time + user time)).  Tested on the threadripper (32-core/64-thread).</div><div><br></div><div>* The heavy use of shared mmap()s across processes no longer multiplies the pv_entry use, saving a lot of memory.  This can be particularly important for postgres.</div><div><br></div><div>* Concurrent page faults now have essentially no SMP lock contention and only four cache-line bounces for atomic ops per fault (something that we may now also be able to deal with with the new work as a basis).</div><div><br></div><div>* Zero-fill fault rate appears to max-out the CPU chip's internal data busses, though there is still room for improvement.  I top out at 6.4M zfod/sec (around 25 GBytes/sec worth of zero-fill faults) on the threadripper and I can't seem to get it to go higher.  Note that obviously there is a little more dynamic ram overhead than that from the executing kernel code, but still...</div><div><br></div><div>* Heavy concurrent exec rate on the TR (all 64 threads) for a shared dynamic binary increases from around 6000/sec to 45000/sec.  This is actually important, because bulk builds</div><div><br></div><div>* Heavy concurrent exec rate on the TR for independent static binaries now caps out at around 450000 execs per second.  Which is an insanely high number.</div><div><br></div><div>* Single-threaded page fault rate is still a bit wonky but hit 500K-700K faults/sec (2-3 GBytes/sec).</div><div><br></div><div>--</div><div><br></div><div>Small system comparison using a Ryzen 2400G (4-core/8-thread), release vs master (this includes other work that has gone into master since the last release, too):</div><div><br></div><div>* Single threaded exec rate (shared dynamic binary) - 3180/sec to 3650/sec</div><div><br></div><div>* Single threaded exec rate (independent static binary) - 10307/sec to 12443/sec</div><div><br></div><div>* Concurrent exec rate (shared dynamic binary x 8) - 15160/sec to 19600/sec</div><div><br></div><div>* Concurrent exec rate (independent static binary x 8) - 60800/sec to 78900/sec</div><div><br></div><div>* Single threaded zero-fill fault rate - 550K zfod/sec -> 604K zfod/sec</div><div><br></div><div>* Concurrent zero-fill fault rate (8 threads) - 1.2M zfod/sec -> 1.7M zfod/sec</div><div><br></div><div>* make -j 16 buildkernel test (tmpfs /usr/src, tmpfs /usr/obj):</div><div><br></div><div>    4.4% improvement in overall time on the first run (6.2% improvement on subsequent runs).  system% 15.6% down to 11.2% of total cpu seconds.  This is a kernel overhead reduction of 31%.  Note that the increased time on release is probably due to inefficient buffer cache recycling.</div><div><br></div><div>    1309.445u 242.506s 3:53.54 664.5%   (release)</div><div>    1315.890u 258.165s 4:00.97 653.2%   (release, run 2)    </div><div>    1318.458u 259.394s 4:00.51 656.0%   (release, run 3)<br></div><div><br></div><div>    1329.099u 167.351s 3:46.05 661.9%   (master)</div><div>    1335.791u 169.270s 3:46.13 665.5%   (master, run 2)</div><div>    1334.925u 169.779s 3:46.92 663.0%   (master, run 3)</div><div><br></div><div><div><div>-Matt</div></div></div></div>