Matthew Dillon dillon at backplane.com
Fri Jun 14 10:03:54 PDT 2019

                                           June 9 2019
                    5.4 RELEASE vs UPCOMING 5.6 RELEASE

Here is a set of simple semi-scientific tests headlining performance
improvements in the upcoming 5.6 release over the 5.4 release.  These
improvements were primarily obtained by rewriting (again) major chunks of
the VM system and the PMAP system.

Prior work was able to move many exclusive locks to shared locks.  This new
work is able to do-away with many locks entirely, and reduces the amount of
cache-line ping-ponging occurring between cpu cores when taking faults on
shared VM objects.

These tests were done on a little Haswell 2/4 box and on a Xeon 16/32
dual-socket box.  It demonstrates the following:

    * The massive VM rework modestly reduces per-thread VM fault overheads
      and significantly reduces VM fault overheads on shared VM pages.

      Thus we see a MASSIVE improvement in the concurrent self-exec tests
      when any part of the binary is shared or if it is a dynamic binary
      (uses shared libraries).

      We see a modest improvement for ad-hoc concurrent compile tests.

      We see a small improvement in the buildkernel test on the haswell
      and a more significant improvement on the xeon, which roughly matches
      expectations.  Buildkernel bottlenecks in the linker and a few other
      places (even with NO_MODULES=TRUE).  What is important to note here
      is the huge reduction in system time.  System time dropped by 40%.

    * The zero-fill fault rate has significantly improved.  It's a bit hard
      to test because I am butting up against bandwidth limitations in the
      hardware, but the improvement is a very real 17% (haswell) and
      14% (xeon), respectively.

    * Scheduler fixes in 5.6 improve concurrency and reduce cache-line
      ping-ponging.  Note, however, that the scheduler heuristic in 5.4
      was a bit broken so this mostly restores scheduler performance from
      5.2.  This only effects the DOCOMP test (see note 2 below).

                            Other observations (not shown here)

    * The VM rework got rid of all pv_entry structures for terminal PTEs.
      This can save an enormous amount of ram in certain limited situations
      such as a postgres server with many service processes sharing a
      huge, shared-memory cache.

    * There is a huge reduction in system overheads in some tests.  In fact,
      in most tests, but keep in mind that most tests are already cpu-bound
      in user-mode so the overall real-time improvement in those tests is
      more modest.

    * In synth-based bulk runs I am observing a drop in system overhead
      from 15-20% to 10-15%, and the bulk build does appear to take
      commensurately less time (around 5%).

      That said, certain aspects of the synth bulk run are much, much faster
      now.  The port scans used to be able to run around 5%/sec on our
      (and that was already considered fast!).  Now the port scans run
around 10%/sec.
      This is because the insane concurrent exec load involved with doing
      port scan is directly impacted by this work.

                       This tests a concurrent exec loop sequencing across
                        N CPUs.  It is a simple program which exec's itself
                        and otherwise does nothing.

                        We test (1) A statically linked binary that copies
                        itself to $NAME.$N so each cpu is exec()ing a
                        separate copy, (2) A statically linked binary that
                        does not do the copy step so multiple CPUs are
                        exec()ing the same binary.  (3) A dynamic binary
                        that copies itself (but not the shared libraries
                        it links against), meaning that the shared libraries
                        cause shared faults, and (4) A dynamic binary that
                        fully shared, along with the libraries, so all vnode
                        faults are shared faults.

                       This tests N concurrent processes doing zero-fill
                        VM faults in a private per-process mmap().  Each
                        process is doing a mmap()/force-faults/munmap()

                        This does N concurrent compiles of a small .c
                        waits for them to complete, and then loops.  The
                        compiler is locked to gcc-4.7 (same compiler for
                        release vs master).

                        This is a good concurrent fork/exec/exit/wait test
                        with a smattering of file ops and VM faults in the
                        mix.  It tests some of the scheduler heuristics,

                       This does a high-concurrency buildkernel tests that
                        does not include modules, simulating a typical high-
                        concurrency single-project compile workload.  The
                        compiler is locked to gcc-8.x.  Again, the same

                     Improvement in 5.6 over 5.4

                                      HASWELL    XEON
                                      2/4                16/32
                                      ------              ------
SELF-EXEC S/DI          +23%            +18%
SELF-EXEC S/SH         +28%            +71%           <---- YES, THIS IS
SELF-EXEC D/DI          +23%            +242% (note 1) <---- YES, THIS IS
SELF-EXEC D/SH         +24%            +234% (note 1) <---- YES, THIS IS
FAULTZF                       +17%            +14%
DOCOMP                      +22%            +42%  (note 2)
NATIVEKERNEL           +5.1%           +8.1% (note 3)

                        note 1: These muli-core improvements are the
                                real-thing, due to the VM work in 5.6.

                        note 2: +42% on Xeon mostly due to broken scheduler
                                heuristics 5.4 that are fixed in 5.6.  It
                                is more like +17% if we discount the

                        note 3: Note the huge reduction in system time,
                                but overall improvement is smaller due to
                                make bottlenecking in the linker and the
                                fact that the compile is already mostly
                                user-bound.  Still, 5.6 improves what
                                concurrency there is fairly significantly.



    cc src/doexec.c -o /tmp/doexec -static -O2
    /tmp/doexec 4       (haswell)
    /tmp/doexec 64      (xeon)

    DFLY-5.4              DFLY-5.6
    ------------               -----------
    22474.74/sec        28171.27/sec            HASWELL
    22679.16/sec        28087.63/sec
    22711.79/sec        27816.94/sec
    22688.36/sec        27925.36/sec
    22690.91/sec        27437.11/sec
    22683.06/sec        27909.60/sec

    DFLY-5.4               DFLY-5.6
    ------------                -----------
    124849.28/sec       147981.37/sec           XEON
    124866.95/sec       147749.73/sec
    124703.78/sec       148358.26/sec
    124787.40/sec       148329.21/sec
    124846.69/sec       147842.95/sec
    124963.14/sec       147737.34/sec


    cc src/doexecsh.c -o /tmp/doexecsh -O2 -static
    /tmp/doexecsh 4     (haswell)
    /tmp/doexecsh 64    (xeon)

    DFLY-5.4              DFLY-5.6
    ------------               -----------
    21229.12/sec        27145.65/sec            HASWELL
    21241.22/sec        27147.10/sec
    21318.51/sec        27143.30/sec
    21290.39/sec        27143.84/sec
    21289.14/sec        27139.78/sec
    21251.96/sec        27138.93/sec
    21267.60/sec        27147.17/sec

    80975.02/sec        139732.58/sec           XEON
    80874.21/sec        139366.47/sec
    81029.80/sec        139963.00/sec
    80929.62/sec        139797.41/sec
    81071.90/sec        139151.49/sec
    81135.47/sec        137817.00/sec


    cc src/doexec.c -o /tmp/doexec -O2
    /tmp/doexec 4       (haswell)
    /tmp/doexec 64      (xeon)

    DFLY-5.4              DFLY-5.6
    ------------               -----------
    6216.53/sec         7723.94/sec             HASWELL
    6229.60/sec         7736.56/sec
    6241.70/sec         7735.23/sec
    6236.36/sec         7424.74/sec
    6225.52/sec         7718.90/sec
    6270.69/sec         7721.84/sec
    6242.00/sec         7754.94/sec

    8513.02/sec         27817.08/sec            XEON
    8112.19/sec         27819.83/sec
    8030.58/sec         27815.80/sec
    7933.86/sec         27852.83/sec
    8057.95/sec         27960.89/sec
    8200.34/sec         27835.59/sec


    cc src/doexecsh.c -o /tmp/doexecsh -O2
    /tmp/doexecsh 4

    DFLY-5.4              DFLY-5.6
    ------------               -----------
    6336.83/sec         7850.05/sec             HASWELL
    6319.74/sec         7754.97/sec
    6260.55/sec         7834.43/sec
    6284.35/sec         7848.13/sec
    6315.98/sec         7845.17/sec
    6307.38/sec         7859.41/sec

    8317.51/sec         28002.97/sec            XEON
    8180.47/sec         28001.50/sec
    8367.12/sec         27950.22/sec
    8157.12/sec         28021.92/sec
    8622.51/sec         27848.14/sec
    8575.74/sec         27949.75/sec


    bin/faultzf 4       (Haswell 4-banger)
    2.001u 43.635s 0:11.79 387.0%   1+65k 2+0io 0pf+0w (5.4)
    2.155u 43.082s 0:11.98 377.5%   2+66k 0+0io 0pf+0w (5.4)
    2.120u 43.838s 0:11.68 393.4%   1+65k 0+0io 0pf+0w (5.4)
    (roughly 4.3 GBytes/sec)

    bin/faultzf 4       (Haswell 4-banger)
    2.246u 35.872s 0:10.15 375.4%   2+66k 0+0io 0pf+0w (5.6)
    1.791u 36.971s 0:10.02 386.8%   2+66k 0+0io 0pf+0w (5.6)
    2.162u 36.264s 0:10.15 378.5%   2+66k 0+0io 0pf+0w (5.6)
    (roughly 5.0 GBytes/sec)

    bin/faultzf 32      (Dual-socket Xeon)
    24.195u 525.055s 0:18.86 2912.1%        1+65k 0+0io 0pf+0w  (5.4)
    23.712u 524.950s 0:18.21 3012.9%        2+66k 0+0io 0pf+0w  (5.4)
    23.908u 525.896s 0:18.93 2904.3%        1+65k 0+0io 0pf+0w  (5.4)
    (roughly 23 GBytes/sec)

    bin/faultzf 32      (Dual-socket Xeon)
    22.920u 396.517s 0:16.70 2511.5%        1+65k 0+0io 0pf+0w  (5.6)
    24.705u 401.053s 0:16.49 2581.8%        2+66k 0+0io 0pf+0w  (5.6)
    23.858u 405.876s 0:16.15 2660.8%        1+65k 0+0io 0pf+0w  (5.6)
    (roughly 26 GBytes/sec)


    bin/docomp 8
    OBSERVED EXEC RATE  (Haswell 4-banger)

    DFLY-5.4         DFLY-5.6
    ------------        -----------
    556                 668                     HASWELL
    578                 662
    527                 675
    537                 663
    548                 687
    551                 679

    bin/docomp 64
    OBSERVED EXEC RATE  (Dual-socket Xeon)

    DFLY-5.4         DFLY-5.6
    ------------        -----------
    2073                2871                    XEON
    2025                2777
    2017                2980
    2024                2763
    1852                2821
    2002                2839


    setenv WORLDCCVER gcc80
    setenv CCVER gcc80
    cpdup /usr/src /tmp/src1
    cd /tmp/src1

    HASWELL 2/4
    time make -j 8 nativekernel NO_MODULES=TRUE >& /tmp/bk.out

    563.022u 86.834s 3:55.15 276.3% 10077+756k 40454+106478io 8124pf+0w
    562.748u 88.052s 4:07.72 262.7% 10049+754k 40176+109428io 8030pf+0w
    563.022u 87.575s 4:00.15 270.9% 10053+754k 40174+113226io 8030pf+0w

    556.712u 53.553s 3:47.34 268.4% 10211+767k 40294+105584io 8196pf+0w
    555.731u 54.090s 3:49.96 265.1% 10236+768k 40122+105520io 8028pf+0w
    553.969u 54.987s 3:49.69 265.1% 10240+769k 41300+104052io 8028pf+0w

    XEON 16/32
    time make -j 64 nativekernel NO_MODULES=TRUE >& /tmp/bk.out

    754.497u 104.502s 1:18.57 1093.2%       10074+755k 21418+8io 146pf+0w
    756.155u 105.958s 1:18.40 1099.6%       10065+754k 21418+8io 146pf+0w
    755.878u 107.940s 1:18.38 1102.0%       10037+753k 21418+8io 146pf+0w
    757.779u 107.833s 1:18.66 1100.4%       10049+753k 21418+8io 146pf+0w

    760.121u 67.709s 1:12.53 1141.3%        10232+767k 21388+8io 146pf+0w
    760.652u 66.611s 1:12.63 1139.0%        10239+767k 21388+8io 146pf+0w
    762.902u 66.742s 1:12.72 1140.8%        10249+768k 22508+8io 212pf+0w
    758.254u 67.169s 1:12.41 1139.9%        10240+767k 21388+8io 146pf+0w

