VM work will be in the upcoming 5.6 release.
Matthew Dillon
dillon at backplane.com
Fri Jun 14 10:03:54 PDT 2019
June 9 2019
5.4 RELEASE vs UPCOMING 5.6 RELEASE
Here is a set of simple semi-scientific tests headlining performance
improvements in the upcoming 5.6 release over the 5.4 release. These
improvements were primarily obtained by rewriting (again) major chunks of
the VM system and the PMAP system.
Prior work was able to move many exclusive locks to shared locks. This new
work is able to do-away with many locks entirely, and reduces the amount of
cache-line ping-ponging occurring between cpu cores when taking faults on
shared VM objects.
These tests were done on a little Haswell 2/4 box and on a Xeon 16/32
dual-socket box. It demonstrates the following:
* The massive VM rework modestly reduces per-thread VM fault overheads
and significantly reduces VM fault overheads on shared VM pages.
Thus we see a MASSIVE improvement in the concurrent self-exec tests
when any part of the binary is shared or if it is a dynamic binary
(uses shared libraries).
We see a modest improvement for ad-hoc concurrent compile tests.
We see a small improvement in the buildkernel test on the haswell
and a more significant improvement on the xeon, which roughly matches
expectations. Buildkernel bottlenecks in the linker and a few other
places (even with NO_MODULES=TRUE). What is important to note here
is the huge reduction in system time. System time dropped by 40%.
* The zero-fill fault rate has significantly improved. It's a bit hard
to test because I am butting up against bandwidth limitations in the
hardware, but the improvement is a very real 17% (haswell) and
14% (xeon), respectively.
* Scheduler fixes in 5.6 improve concurrency and reduce cache-line
ping-ponging. Note, however, that the scheduler heuristic in 5.4
was a bit broken so this mostly restores scheduler performance from
5.2. This only effects the DOCOMP test (see note 2 below).
Other observations (not shown here)
* The VM rework got rid of all pv_entry structures for terminal PTEs.
This can save an enormous amount of ram in certain limited situations
such as a postgres server with many service processes sharing a
single,
huge, shared-memory cache.
* There is a huge reduction in system overheads in some tests. In fact,
in most tests, but keep in mind that most tests are already cpu-bound
in user-mode so the overall real-time improvement in those tests is
more modest.
* In synth-based bulk runs I am observing a drop in system overhead
from 15-20% to 10-15%, and the bulk build does appear to take
commensurately less time (around 5%).
That said, certain aspects of the synth bulk run are much, much faster
now. The port scans used to be able to run around 5%/sec on our
threadripper
(and that was already considered fast!). Now the port scans run
around 10%/sec.
This is because the insane concurrent exec load involved with doing
the
port scan is directly impacted by this work.
SELF-EXEC TESTS
This tests a concurrent exec loop sequencing across
N CPUs. It is a simple program which exec's itself
and otherwise does nothing.
We test (1) A statically linked binary that copies
itself to $NAME.$N so each cpu is exec()ing a
separate copy, (2) A statically linked binary that
does not do the copy step so multiple CPUs are
exec()ing the same binary. (3) A dynamic binary
that copies itself (but not the shared libraries
it links against), meaning that the shared libraries
cause shared faults, and (4) A dynamic binary that
is
fully shared, along with the libraries, so all vnode
faults are shared faults.
FAULTZF
This tests N concurrent processes doing zero-fill
VM faults in a private per-process mmap(). Each
process is doing a mmap()/force-faults/munmap()
loop.
DOCOMP
This does N concurrent compiles of a small .c
program,
waits for them to complete, and then loops. The
compiler is locked to gcc-4.7 (same compiler for
release vs master).
This is a good concurrent fork/exec/exit/wait test
with a smattering of file ops and VM faults in the
mix. It tests some of the scheduler heuristics,
too.
NATIVEKERNEL
This does a high-concurrency buildkernel tests that
does not include modules, simulating a typical high-
concurrency single-project compile workload. The
compiler is locked to gcc-8.x. Again, the same
compiler.
--------------------------------------
Improvement in 5.6 over 5.4
HASWELL XEON
2/4 16/32
------ ------
SELF-EXEC S/DI +23% +18%
SELF-EXEC S/SH +28% +71% <---- YES, THIS IS
REAL
SELF-EXEC D/DI +23% +242% (note 1) <---- YES, THIS IS
REAL
SELF-EXEC D/SH +24% +234% (note 1) <---- YES, THIS IS
REAL
FAULTZF +17% +14%
DOCOMP +22% +42% (note 2)
NATIVEKERNEL +5.1% +8.1% (note 3)
note 1: These muli-core improvements are the
real-thing, due to the VM work in 5.6.
note 2: +42% on Xeon mostly due to broken scheduler
heuristics 5.4 that are fixed in 5.6. It
is more like +17% if we discount the
scheduler
fixes.
note 3: Note the huge reduction in system time,
but overall improvement is smaller due to
make bottlenecking in the linker and the
fact that the compile is already mostly
user-bound. Still, 5.6 improves what
concurrency there is fairly significantly.
--------------------------------------
SELF-EXEC TEST, STATIC BINARY, DISCRETE BINARIES (e.g. /tmp/doexec.$N)
cc src/doexec.c -o /tmp/doexec -static -O2
/tmp/doexec 4 (haswell)
/tmp/doexec 64 (xeon)
DFLY-5.4 DFLY-5.6
------------ -----------
22474.74/sec 28171.27/sec HASWELL
22679.16/sec 28087.63/sec
22711.79/sec 27816.94/sec
22688.36/sec 27925.36/sec
22690.91/sec 27437.11/sec
22683.06/sec 27909.60/sec
DFLY-5.4 DFLY-5.6
------------ -----------
124849.28/sec 147981.37/sec XEON
124866.95/sec 147749.73/sec
124703.78/sec 148358.26/sec
124787.40/sec 148329.21/sec
124846.69/sec 147842.95/sec
124963.14/sec 147737.34/sec
SELF-EXEC TEST, STATIC BINARY, SAME BINARY
cc src/doexecsh.c -o /tmp/doexecsh -O2 -static
/tmp/doexecsh 4 (haswell)
/tmp/doexecsh 64 (xeon)
DFLY-5.4 DFLY-5.6
------------ -----------
21229.12/sec 27145.65/sec HASWELL
21241.22/sec 27147.10/sec
21318.51/sec 27143.30/sec
21290.39/sec 27143.84/sec
21289.14/sec 27139.78/sec
21251.96/sec 27138.93/sec
21267.60/sec 27147.17/sec
80975.02/sec 139732.58/sec XEON
80874.21/sec 139366.47/sec
81029.80/sec 139963.00/sec
80929.62/sec 139797.41/sec
81071.90/sec 139151.49/sec
81135.47/sec 137817.00/sec
SELF-EXEC TEST, SHARED BINARY, DISCRETE BINARIES (e.g. /tmp/doexec.$N)
cc src/doexec.c -o /tmp/doexec -O2
/tmp/doexec 4 (haswell)
/tmp/doexec 64 (xeon)
DFLY-5.4 DFLY-5.6
------------ -----------
6216.53/sec 7723.94/sec HASWELL
6229.60/sec 7736.56/sec
6241.70/sec 7735.23/sec
6236.36/sec 7424.74/sec
6225.52/sec 7718.90/sec
6270.69/sec 7721.84/sec
6242.00/sec 7754.94/sec
8513.02/sec 27817.08/sec XEON
8112.19/sec 27819.83/sec
8030.58/sec 27815.80/sec
7933.86/sec 27852.83/sec
8057.95/sec 27960.89/sec
8200.34/sec 27835.59/sec
SELF-EXEC TEST, SHARED BINARY, SAME BINARY
cc src/doexecsh.c -o /tmp/doexecsh -O2
/tmp/doexecsh 4
DFLY-5.4 DFLY-5.6
------------ -----------
6336.83/sec 7850.05/sec HASWELL
6319.74/sec 7754.97/sec
6260.55/sec 7834.43/sec
6284.35/sec 7848.13/sec
6315.98/sec 7845.17/sec
6307.38/sec 7859.41/sec
8317.51/sec 28002.97/sec XEON
8180.47/sec 28001.50/sec
8367.12/sec 27950.22/sec
8157.12/sec 28021.92/sec
8622.51/sec 27848.14/sec
8575.74/sec 27949.75/sec
FAULTZF
bin/faultzf 4 (Haswell 4-banger)
2.001u 43.635s 0:11.79 387.0% 1+65k 2+0io 0pf+0w (5.4)
2.155u 43.082s 0:11.98 377.5% 2+66k 0+0io 0pf+0w (5.4)
2.120u 43.838s 0:11.68 393.4% 1+65k 0+0io 0pf+0w (5.4)
(roughly 4.3 GBytes/sec)
bin/faultzf 4 (Haswell 4-banger)
2.246u 35.872s 0:10.15 375.4% 2+66k 0+0io 0pf+0w (5.6)
1.791u 36.971s 0:10.02 386.8% 2+66k 0+0io 0pf+0w (5.6)
2.162u 36.264s 0:10.15 378.5% 2+66k 0+0io 0pf+0w (5.6)
(roughly 5.0 GBytes/sec)
bin/faultzf 32 (Dual-socket Xeon)
24.195u 525.055s 0:18.86 2912.1% 1+65k 0+0io 0pf+0w (5.4)
23.712u 524.950s 0:18.21 3012.9% 2+66k 0+0io 0pf+0w (5.4)
23.908u 525.896s 0:18.93 2904.3% 1+65k 0+0io 0pf+0w (5.4)
(roughly 23 GBytes/sec)
bin/faultzf 32 (Dual-socket Xeon)
22.920u 396.517s 0:16.70 2511.5% 1+65k 0+0io 0pf+0w (5.6)
24.705u 401.053s 0:16.49 2581.8% 2+66k 0+0io 0pf+0w (5.6)
23.858u 405.876s 0:16.15 2660.8% 1+65k 0+0io 0pf+0w (5.6)
(roughly 26 GBytes/sec)
DOCOMP
bin/docomp 8
OBSERVED EXEC RATE (Haswell 4-banger)
DFLY-5.4 DFLY-5.6
------------ -----------
556 668 HASWELL
578 662
527 675
537 663
548 687
551 679
bin/docomp 64
OBSERVED EXEC RATE (Dual-socket Xeon)
DFLY-5.4 DFLY-5.6
------------ -----------
2073 2871 XEON
2025 2777
2017 2980
2024 2763
1852 2821
2002 2839
NATIVEKERNEL
setenv WORLDCCVER gcc80
setenv CCVER gcc80
cpdup /usr/src /tmp/src1
cd /tmp/src1
HASWELL 2/4
time make -j 8 nativekernel NO_MODULES=TRUE >& /tmp/bk.out
563.022u 86.834s 3:55.15 276.3% 10077+756k 40454+106478io 8124pf+0w
(5.4)
562.748u 88.052s 4:07.72 262.7% 10049+754k 40176+109428io 8030pf+0w
(5.4)
563.022u 87.575s 4:00.15 270.9% 10053+754k 40174+113226io 8030pf+0w
(5.4)
556.712u 53.553s 3:47.34 268.4% 10211+767k 40294+105584io 8196pf+0w
(5.6)
555.731u 54.090s 3:49.96 265.1% 10236+768k 40122+105520io 8028pf+0w
(5.6)
553.969u 54.987s 3:49.69 265.1% 10240+769k 41300+104052io 8028pf+0w
(5.6)
XEON 16/32
time make -j 64 nativekernel NO_MODULES=TRUE >& /tmp/bk.out
754.497u 104.502s 1:18.57 1093.2% 10074+755k 21418+8io 146pf+0w
(5.4)
756.155u 105.958s 1:18.40 1099.6% 10065+754k 21418+8io 146pf+0w
(5.4)
755.878u 107.940s 1:18.38 1102.0% 10037+753k 21418+8io 146pf+0w
(5.4)
757.779u 107.833s 1:18.66 1100.4% 10049+753k 21418+8io 146pf+0w
(5.4)
760.121u 67.709s 1:12.53 1141.3% 10232+767k 21388+8io 146pf+0w
(5.6)
760.652u 66.611s 1:12.63 1139.0% 10239+767k 21388+8io 146pf+0w
(5.6)
762.902u 66.742s 1:12.72 1140.8% 10249+768k 22508+8io 212pf+0w
(5.6)
758.254u 67.169s 1:12.41 1139.9% 10240+767k 21388+8io 146pf+0w
(5.6)
-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20190614/720c85ab/attachment-0002.htm>
More information about the Users
mailing list