<div dir="ltr">                                           June 9 2019<br>                    5.4 RELEASE vs UPCOMING 5.6 RELEASE<br><br>Here is a set of simple semi-scientific tests headlining performance<br>improvements in the upcoming 5.6 release over the 5.4 release.  These<br>improvements were primarily obtained by rewriting (again) major chunks of<br>the VM system and the PMAP system.<br><br>Prior work was able to move many exclusive locks to shared locks.  This new<br>work is able to do-away with many locks entirely, and reduces the amount of<br>cache-line ping-ponging occurring between cpu cores when taking faults on<br>shared VM objects.<br><br>These tests were done on a little Haswell 2/4 box and on a Xeon 16/32<br>dual-socket box.  It demonstrates the following:<br><br>    * The massive VM rework modestly reduces per-thread VM fault overheads<br>      and significantly reduces VM fault overheads on shared VM pages.<br><br>      Thus we see a MASSIVE improvement in the concurrent self-exec tests<br>      when any part of the binary is shared or if it is a dynamic binary<br>      (uses shared libraries).<br><br>      We see a modest improvement for ad-hoc concurrent compile tests.<br><br>      We see a small improvement in the buildkernel test on the haswell<br>      and a more significant improvement on the xeon, which roughly matches<br>      expectations.  Buildkernel bottlenecks in the linker and a few other<br>      places (even with NO_MODULES=TRUE).  What is important to note here<br>      is the huge reduction in system time.  System time dropped by 40%.<br><br>    * The zero-fill fault rate has significantly improved.  It's a bit hard<br>      to test because I am butting up against bandwidth limitations in the<br>      hardware, but the improvement is a very real 17% (haswell) and<br>      14% (xeon), respectively.<br><br>    * Scheduler fixes in 5.6 improve concurrency and reduce cache-line<br>      ping-ponging.  Note, however, that the scheduler heuristic in 5.4<br>      was a bit broken so this mostly restores scheduler performance from<br>      5.2.  This only effects the DOCOMP test (see note 2 below).<br><br>                            Other observations (not shown here)<br><br>    * The VM rework got rid of all pv_entry structures for terminal PTEs.<br>      This can save an enormous amount of ram in certain limited situations<br>      such as a postgres server with many service processes sharing a single,<br>      huge, shared-memory cache.<br><div><br></div><div>    * There is a huge reduction in system overheads in some tests.  In fact,<br>      in most tests, but keep in mind that most tests are already cpu-bound<br>      in user-mode so the overall real-time improvement in those tests is<br>      more modest.<br><br>    * In synth-based bulk runs I am observing a drop in system overhead<br>      from 15-20% to 10-15%, and the bulk build does appear to take<br>      commensurately less time (around 5%).<br><br>      That said, certain aspects of the synth bulk run are much, much faster<br>      now.  The port scans used to be able to run around 5%/sec on our threadripper</div><div>      (and that was already considered fast!).  Now the port scans run around 10%/sec.<br>      This is because the insane concurrent exec load involved with doing the<br>      port scan is directly impacted by this work.<br><br><br>SELF-EXEC TESTS        </div><div>                       This tests a concurrent exec loop sequencing across<br>                        N CPUs.  It is a simple program which exec's itself<br>                        and otherwise does nothing.<br>                         <br>                        We test (1) A statically linked binary that copies<br>                        itself to $NAME.$N so each cpu is exec()ing a<br>                        separate copy, (2) A statically linked binary that<br>                        does not do the copy step so multiple CPUs are<br>                        exec()ing the same binary.  (3) A dynamic binary<br>                        that copies itself (but not the shared libraries<br>                        it links against), meaning that the shared libraries<br>                        cause shared faults, and (4) A dynamic binary that is<br>                        fully shared, along with the libraries, so all vnode<br>                        faults are shared faults.<br><br>FAULTZF         </div><div>                       This tests N concurrent processes doing zero-fill<br>                        VM faults in a private per-process mmap().  Each<br>                        process is doing a mmap()/force-faults/munmap() loop.<br><br>DOCOMP        </div><div>                        This does N concurrent compiles of a small .c program,<br>                        waits for them to complete, and then loops.  The<br>                        compiler is locked to gcc-4.7 (same compiler for<br>                        release vs master).<br><br>                        This is a good concurrent fork/exec/exit/wait test<br>                        with a smattering of file ops and VM faults in the<br>                        mix.  It tests some of the scheduler heuristics, too.<br><br>NATIVEKERNEL</div><div>                       This does a high-concurrency buildkernel tests that<br>                        does not include modules, simulating a typical high-<br>                        concurrency single-project compile workload.  The<br>                        compiler is locked to gcc-8.x.  Again, the same<br>                        compiler.<br><br>                --------------------------------------<br>                     Improvement in 5.6 over 5.4<br><br>                                      HASWELL    XEON<br>                                      2/4                16/32<br>                                      ------              ------<br>SELF-EXEC S/DI          +23%            +18%<br>SELF-EXEC S/SH         +28%            +71%           <---- YES, THIS IS REAL<br>SELF-EXEC D/DI          +23%            +242% (note 1) <---- YES, THIS IS REAL<br>SELF-EXEC D/SH         +24%            +234% (note 1) <---- YES, THIS IS REAL<br>FAULTZF                       +17%            +14%<br>DOCOMP                      +22%            +42%  (note 2)<br>NATIVEKERNEL           +5.1%           +8.1% (note 3)<br><br>                        note 1: These muli-core improvements are the<br>                                real-thing, due to the VM work in 5.6.<br><br>                        note 2: +42% on Xeon mostly due to broken scheduler<br>                                heuristics 5.4 that are fixed in 5.6.  It<br>                                is more like +17% if we discount the scheduler<br>                                fixes.<br><br>                        note 3: Note the huge reduction in system time,<br>                                but overall improvement is smaller due to<br>                                make bottlenecking in the linker and the<br>                                fact that the compile is already mostly<br>                                user-bound.  Still, 5.6 improves what<br>                                concurrency there is fairly significantly.<br><br>                --------------------------------------<br><br><br>SELF-EXEC TEST, STATIC BINARY, DISCRETE BINARIES (e.g. /tmp/doexec.$N)<br><br>    cc src/doexec.c -o /tmp/doexec -static -O2<br>    /tmp/doexec 4       (haswell)<br>    /tmp/doexec 64      (xeon)<br><br>    DFLY-5.4              DFLY-5.6<br>    ------------               -----------<br>    22474.74/sec        28171.27/sec            HASWELL<br>    22679.16/sec        28087.63/sec<br>    22711.79/sec        27816.94/sec<br>    22688.36/sec        27925.36/sec<br>    22690.91/sec        27437.11/sec<br>    22683.06/sec        27909.60/sec<br><br>    DFLY-5.4               DFLY-5.6<br>    ------------                -----------<br>    124849.28/sec       147981.37/sec           XEON<br>    124866.95/sec       147749.73/sec<br>    124703.78/sec       148358.26/sec<br>    124787.40/sec       148329.21/sec<br>    124846.69/sec       147842.95/sec<br>    124963.14/sec       147737.34/sec<br><br>SELF-EXEC TEST, STATIC BINARY, SAME BINARY<br><br>    cc src/doexecsh.c -o /tmp/doexecsh -O2 -static<br>    /tmp/doexecsh 4     (haswell)<br>    /tmp/doexecsh 64    (xeon)<br><br>    DFLY-5.4              DFLY-5.6<br>    ------------               -----------<br>    21229.12/sec        27145.65/sec            HASWELL<br>    21241.22/sec        27147.10/sec<br>    21318.51/sec        27143.30/sec<br>    21290.39/sec        27143.84/sec<br>    21289.14/sec        27139.78/sec<br>    21251.96/sec        27138.93/sec<br>    21267.60/sec        27147.17/sec<br><br>    80975.02/sec        139732.58/sec           XEON<br>    80874.21/sec        139366.47/sec<br>    81029.80/sec        139963.00/sec<br>    80929.62/sec        139797.41/sec<br>    81071.90/sec        139151.49/sec<br>    81135.47/sec        137817.00/sec<br><br><br>SELF-EXEC TEST, SHARED BINARY, DISCRETE BINARIES (e.g. /tmp/doexec.$N)<br><br>    cc src/doexec.c -o /tmp/doexec -O2<br>    /tmp/doexec 4       (haswell)<br>    /tmp/doexec 64      (xeon)<br><br>    DFLY-5.4              DFLY-5.6<br>    ------------               -----------<br>    6216.53/sec         7723.94/sec             HASWELL<br>    6229.60/sec         7736.56/sec<br>    6241.70/sec         7735.23/sec<br>    6236.36/sec         7424.74/sec<br>    6225.52/sec         7718.90/sec<br>    6270.69/sec         7721.84/sec<br>    6242.00/sec         7754.94/sec<br><br>    8513.02/sec         27817.08/sec            XEON<br>    8112.19/sec         27819.83/sec<br>    8030.58/sec         27815.80/sec<br>    7933.86/sec         27852.83/sec<br>    8057.95/sec         27960.89/sec<br>    8200.34/sec         27835.59/sec<br><br><br>SELF-EXEC TEST, SHARED BINARY, SAME BINARY<br><br>    cc src/doexecsh.c -o /tmp/doexecsh -O2<br>    /tmp/doexecsh 4<br></div><div><br>    DFLY-5.4              DFLY-5.6<br>    ------------               -----------<br>    6336.83/sec         7850.05/sec             HASWELL<br>    6319.74/sec         7754.97/sec<br>    6260.55/sec         7834.43/sec<br>    6284.35/sec         7848.13/sec<br>    6315.98/sec         7845.17/sec<br>    6307.38/sec         7859.41/sec<br><br>    8317.51/sec         28002.97/sec            XEON<br>    8180.47/sec         28001.50/sec<br>    8367.12/sec         27950.22/sec<br>    8157.12/sec         28021.92/sec<br>    8622.51/sec         27848.14/sec<br>    8575.74/sec         27949.75/sec<br><br><br>FAULTZF<br><br>    bin/faultzf 4       (Haswell 4-banger)<br>    2.001u 43.635s 0:11.79 387.0%   1+65k 2+0io 0pf+0w (5.4)<br>    2.155u 43.082s 0:11.98 377.5%   2+66k 0+0io 0pf+0w (5.4)<br>    2.120u 43.838s 0:11.68 393.4%   1+65k 0+0io 0pf+0w (5.4)<br>    (roughly 4.3 GBytes/sec)<br><br>    bin/faultzf 4       (Haswell 4-banger)<br>    2.246u 35.872s 0:10.15 375.4%   2+66k 0+0io 0pf+0w (5.6)<br>    1.791u 36.971s 0:10.02 386.8%   2+66k 0+0io 0pf+0w (5.6)<br>    2.162u 36.264s 0:10.15 378.5%   2+66k 0+0io 0pf+0w (5.6)<br>    (roughly 5.0 GBytes/sec)<br><br>    bin/faultzf 32      (Dual-socket Xeon)<br>    24.195u 525.055s 0:18.86 2912.1%        1+65k 0+0io 0pf+0w  (5.4)<br>    23.712u 524.950s 0:18.21 3012.9%        2+66k 0+0io 0pf+0w  (5.4)<br>    23.908u 525.896s 0:18.93 2904.3%        1+65k 0+0io 0pf+0w  (5.4)<br>    (roughly 23 GBytes/sec)<br><br>    bin/faultzf 32      (Dual-socket Xeon)<br>    22.920u 396.517s 0:16.70 2511.5%        1+65k 0+0io 0pf+0w  (5.6)<br>    24.705u 401.053s 0:16.49 2581.8%        2+66k 0+0io 0pf+0w  (5.6)<br>    23.858u 405.876s 0:16.15 2660.8%        1+65k 0+0io 0pf+0w  (5.6)<br>    (roughly 26 GBytes/sec)<br><br><br>DOCOMP<br><br>    bin/docomp 8<br>    OBSERVED EXEC RATE  (Haswell 4-banger)<br><br>    DFLY-5.4         DFLY-5.6<br>    ------------        -----------<br>    556                 668                     HASWELL<br>    578                 662<br>    527                 675<br>    537                 663<br>    548                 687<br>    551                 679<br><br></div><div><br>    bin/docomp 64<br>    OBSERVED EXEC RATE  (Dual-socket Xeon)<br><br>    DFLY-5.4         DFLY-5.6<br>    ------------        -----------<br>    2073                2871                    XEON<br>    2025                2777<br>    2017                2980<br>    2024                2763<br>    1852                2821<br>    2002                2839<br><br><br>NATIVEKERNEL<br><br>    setenv WORLDCCVER gcc80<br>    setenv CCVER gcc80<br>    cpdup /usr/src /tmp/src1<br>    cd /tmp/src1<br><br>    HASWELL 2/4<br>    time make -j 8 nativekernel NO_MODULES=TRUE >& /tmp/bk.out<br><br>    563.022u 86.834s 3:55.15 276.3% 10077+756k 40454+106478io 8124pf+0w (5.4)<br>    562.748u 88.052s 4:07.72 262.7% 10049+754k 40176+109428io 8030pf+0w (5.4)<br>    563.022u 87.575s 4:00.15 270.9% 10053+754k 40174+113226io 8030pf+0w (5.4)<br><br>    556.712u 53.553s 3:47.34 268.4% 10211+767k 40294+105584io 8196pf+0w (5.6)<br>    555.731u 54.090s 3:49.96 265.1% 10236+768k 40122+105520io 8028pf+0w (5.6)<br>    553.969u 54.987s 3:49.69 265.1% 10240+769k 41300+104052io 8028pf+0w (5.6)<br><br><br>    XEON 16/32<br>    time make -j 64 nativekernel NO_MODULES=TRUE >& /tmp/bk.out<br><br>    754.497u 104.502s 1:18.57 1093.2%       10074+755k 21418+8io 146pf+0w (5.4)<br>    756.155u 105.958s 1:18.40 1099.6%       10065+754k 21418+8io 146pf+0w (5.4)<br>    755.878u 107.940s 1:18.38 1102.0%       10037+753k 21418+8io 146pf+0w (5.4)<br>    757.779u 107.833s 1:18.66 1100.4%       10049+753k 21418+8io 146pf+0w (5.4)<br><br>    760.121u 67.709s 1:12.53 1141.3%        10232+767k 21388+8io 146pf+0w (5.6)<br>    760.652u 66.611s 1:12.63 1139.0%        10239+767k 21388+8io 146pf+0w (5.6)<br>    762.902u 66.742s 1:12.72 1140.8%        10249+768k 22508+8io 212pf+0w (5.6)<br>    758.254u 67.169s 1:12.41 1139.9%        10240+767k 21388+8io 146pf+0w (5.6)<br><br></div><div>-Matt</div></div>