Major kernel VM pmap work in master

Tue May 21 12:24:05 PDT 2019

Master has received some major VM work so please take care if you decide to
update or upgrade your system, it may lose a little stability.  A full
buildworld and buildkernel is needed due to internal structural changes.
The work is also not entirely complete, there are two or three memory
conservation routines that have not been put back in yet.  That said, the
work looks pretty solid under brute force testing.

The new work going in basically rewrites the handling of leaf PTEs in the
pmap subsystem.  Each vm_page entered into the MMU's pmap used to be
tracked with a 'pv_entry' structure.  The new work gets rid of these
tracking structures for leaf pages.  This saves memory, helps deal with
certain degenerate situations when many processes share lots of memory, and
significantly improves concurrent page fault performance because we no
longer have to do any list manipulation on a per-page basis.

Replacing this old system is a new system where we use vm_map_backing
structures which hang off of vm_map_entry's... essentially one structure
for each 'whole mmap() operation', with some replication for copy-on-write
shadowing.  So, instead of having a structure for each individual page in
each individual pmap, we now have a single structure that covers
potentially many pages.  The new tracking structures are locked, but the
number of lock operations is reduced by a factor of 100 (at least), or even
better.

Currently the committed work is undergoing stability testing and there will
be follow-up commits to fix things like minor memory leaks and so forth, so
expect those to be incoming.

Work still to do:

* I need to optimize vm_fault_collapse() to retain backing vnodes.
Currently any shadow object chain deeper than 5 causes the entry to fault
all pages to the front object and then disconnect the backing objects.  But
this includes the terminal vnode object which I don't actually want to
include.

* I need to put page table pruning back in (right now empty page table
pages are just left in the pmap until exit() to avoid racing the pmap's
pmap_page_*() code)

* I need to implement a new algorithm to locate and destroy completely
shadowed anonymous pages.

None of this is critical for the majority of use cases, though.  The
vm_object shadowing code does limit the depth so completely shadowed
objects won't just build up forever.

--

These changes significantly improve page fault performance, particularly
under heavy concurrent loads.

* kernel overhead during the 'synth everything' bulk build is now under 15%
system time.  It used to be over 20%.  (system time / (system time + user
time)).  Tested on the threadripper (32-core/64-thread).

* The heavy use of shared mmap()s across processes no longer multiplies the
pv_entry use, saving a lot of memory.  This can be particularly important
for postgres.

* Concurrent page faults now have essentially no SMP lock contention and
only four cache-line bounces for atomic ops per fault (something that we
may now also be able to deal with with the new work as a basis).

* Zero-fill fault rate appears to max-out the CPU chip's internal data
busses, though there is still room for improvement.  I top out at 6.4M
zfod/sec (around 25 GBytes/sec worth of zero-fill faults) on the
threadripper and I can't seem to get it to go higher.  Note that obviously
there is a little more dynamic ram overhead than that from the executing
kernel code, but still...

* Heavy concurrent exec rate on the TR (all 64 threads) for a shared
dynamic binary increases from around 6000/sec to 45000/sec.  This is
actually important, because bulk builds

* Heavy concurrent exec rate on the TR for independent static binaries now
caps out at around 450000 execs per second.  Which is an insanely high
number.

* Single-threaded page fault rate is still a bit wonky but hit 500K-700K
faults/sec (2-3 GBytes/sec).

--

Small system comparison using a Ryzen 2400G (4-core/8-thread), release vs
master (this includes other work that has gone into master since the last
release, too):

* Single threaded exec rate (shared dynamic binary) - 3180/sec to 3650/sec

* Single threaded exec rate (independent static binary) - 10307/sec to
12443/sec

* Concurrent exec rate (shared dynamic binary x 8) - 15160/sec to 19600/sec

* Concurrent exec rate (independent static binary x 8) - 60800/sec to
78900/sec

* Single threaded zero-fill fault rate - 550K zfod/sec -> 604K zfod/sec

* Concurrent zero-fill fault rate (8 threads) - 1.2M zfod/sec -> 1.7M
zfod/sec

* make -j 16 buildkernel test (tmpfs /usr/src, tmpfs /usr/obj):

    4.4% improvement in overall time on the first run (6.2% improvement on
subsequent runs).  system% 15.6% down to 11.2% of total cpu seconds.  This
is a kernel overhead reduction of 31%.  Note that the increased time on
release is probably due to inefficient buffer cache recycling.

    1309.445u 242.506s 3:53.54 664.5%   (release)
    1315.890u 258.165s 4:00.97 653.2%   (release, run 2)
    1318.458u 259.394s 4:00.51 656.0%   (release, run 3)

    1329.099u 167.351s 3:46.05 661.9%   (master)
    1335.791u 169.270s 3:46.13 665.5%   (master, run 2)
    1334.925u 169.779s 3:46.92 663.0%   (master, run 3)

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20190521/bec6c31d/attachment-0002.htm>