Update on status of master - NUMA and cache-line locality

Fri Feb 24 22:46:49 PST 2017

Hello everyone!  We haven't decided on when 4.8-release will roll but as it
currently stands the master branch has gotten quite stable after the
tsunami of work I put into it the last two months.  We are discussing it
now.  While the kernel core and base system are in good shape, both DRM and
the state of DPorts are still on the table.  We need to coordinate with the
related maintainers and schedule the release roll.

The work that has gone into master the last two months has focused on the
minutiae of the effect of cache line bounces in various subsystems.  These
are not contested locks per-say, instead they reflect performance on
multi-socket systems when the same global might be hit by cpus on different
sockets.  These so-called cache-line-bounces are cpu-cache-inefficient and
can have overheads in the 200ns+ range, severely limiting the rate of
uncontended lock operations that can be performed on said memory location

In addition to that work the vkernel support (with VMM disabled) has also
gone through some significant optimizations to reduce overheads in the real
kernel caused by the fact that the vkernerl's main memory resource is a
single VM object in the real kernel.  With VMM disabled, the vkernel is
quite stable now, and performance has improved very significantly.

--

Going back to the cache line bouncing work now.  The biggest improvement
has come from making the real kernel more NUMA aware and removing cache
line bounces from the pcpu vm_page allocation paths.  These paths were
already fairly conflict-free due to fanning the memory queues out.  The
main form of the work was to go the next step and partition the queues such
that the set of go-to queues each cpu would try to use becomes completely
independent of the go-to set used by some other cpu.  Adding a little NUMA
awareness to that further localized the memory allocator.

It should be noted that NUMA awareness generally only gives minor
improvements as CPU caches absorb a lot of the overhead already.  That
said, being NUMA-aware offloads the QPI links between cpu sockets and there
are certainly going to be machine workloads that can take advantage of
that.  In my tests the IPC improvements were there, but not really all that
significant in real workloads.  Local vs Remote (inter-socket) memory
accesses went from 50:50 to around 90:10 with the changes for a
concurrent-build workload.

The result of all of this is that the duel-xeon box (16-core/32-thread)
I've been running tests on now has an upper bound of around 13M VM
faults/sec for existing pages, and up to 6M zero-fill faults/sec.  It can
exec distinct static executables at a rate of 135,000 execs/sec, and can
allocate and retire new memory at a rate of around 25 GBytes/sec.  Distinct
path lookup performance using stat() is now on the order of 20M operations
per second (single element, scaling down linearly for multiple distinct
elements) and over 1M paths per second (two elements with one shared, which
tells you something about the effect cache-line bounces have on an
unavoidable shared lock).  For these focused tests the improvement has been
massive, in some cases well over 30%.

It looks like heavily concurrent network performance has also improved.
Sepherosa is working on a paper in that regard and it was already very good
before.  But now its even better :-).  Now that we have NUMA awareness we
are starting to look at hardware locality to socket and memory for
drivers.  Network drivers will reap the benefits first, most likely.

In terms of real-world workloads this amounts to somewhere around a 5%
improvement relative to October or so (~5 months worth of work) for the
high-concurrency synth test.  The reason for this is that workloads of
course have a very large user-time component and the improvements were
mostly with system-time.  For example, during a typical synth build the
whole system averages around 80% user and 20% system, so even a big
reduction in system overhead does not necessarily move the needle much for
the real-world workload.  In contrast, the vkernel, which is far more
heavily dependent on the host kernel system overhead, reaped more
significant workload improvements.  I didn't run conclusive tests but as
the number of vkernel cpus rises the improvement was non-linear, well over
40% in some tests.

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20170224/5c94fd44/attachment.html>