Performance work on master, state of master w/SMP changes

Mon Jan 9 17:38:54 PST 2017

At this moment master should be stable.  Some instability developed over
the last two weeks but should now be fixed.  However, a lot of work has
gone in recently so people using master should be sure to have a working
older kernel stuffed away in /boot/kernel.bak.

In the recent work of the last few weeks focused on four things (1)
Avoiding unecessary global page table invalidations, (2) Fixing the
remaining SMP conflicts that could be easily avoided, (3) Reducing or
Eliminating global memory ping-ponging between cpu sockets, and (4)
Implementing a degree of NUMA awareness in the kernel.

(1) The main source for global page table invalidations is the kernel
buffer cache.  The cycling of buffers through a full buffer cache require
tearing down page mappings and instantiating new mappings.  Instead of
trying to implement unmapped buffers or partial on-the-fly invalidations
(which can make debugging problems difficult to impossible), the approach I
took in DragonFly was to actually repurpose the underlying VM pages when
I/O bandwidth exceeded a certain value, usually a gigabyte second.

This killed two birds with one stone.  The repurposing meant that the
related buffer cache buffer could avoid remapping pages entirely in many
cases, and reusing pages in situations with extreme I/O bandwidths
de-stressed the VM paging system which would otherwise have to cycle all of
those pages through the active, inactive, cache, and free queues.

These changes greatly reduced global IPI rates, reduced synchronous stalls
across all cpus, and improved filesystem performance on systems with
extremely high-bandwidth I/O paths (such as NVMe).

(2) Remaining SMP conflicts are easily measured using systat -pv 1.  The
main conflicts we had left were primarily relegated to the VNODE management
subsystem and the VM PAGE subsystem.  The vnode management subsystem was
largely fixed by not ping-ponging a vnode between the active and inactive
vnode states.  Instead, the vnode becomes passively inactive but remains on
the active queue in the critical path, and is cleaned up later.

SMP conflicts in the VM page subsystem were multi-fold due to various
structures not being cache-line aware, the use of a pool of spinlocks for
vm_page_t locking, and due to the lack of NUMA awareness.  I got rid of the
pool and made the structures more cache aware.  I cover the rest in (4).

(3) Global memory ping-ponging occurs in lockless operations, atomic or
otherwise.  For example, of a system call increments a global counter with
a simple ++FubarCounter on a multi-socket system the cpu hardware must
bounce the cache line containing FubarCounter between all the cores,
creating stall conditions and high latencies which construct best-case
performance.  Latencies as high as a microsecond or two can occur and we
lose some of the scale as the number of cores increase.

Global memory ping-ponging is easy to detect using PC sampling, which
'systat -pv 1' in a wide window run as root will do automatically.  These
memory stalls are significant enough that the PC sampling interrupt will
more often than not catch the exact instructions causing the problem.
Using this method I was able to track down a *HUGE* number of statistics
counters, tracking variables, and other misc globals and generally either
make them per-cpu or flat-out remove them.  There are now no longer *any*
global memory ping-pongs in the non-shared VM code path.

As an example of what we get from this, the dual Xeon system was topping
out at 1.5-2M zero-fill page faults a second running a test program on all
32 threads before the changes.  After the changes (and all the other work),
the same system is now pushing 5.6 MILLION zero-fill page faults/sec across
32 threads.  Our four-socket opteron system also saw major improvements and
can now achieve something like 4.7M zero-fill page faults a second across
the 48-cores.

The dual-Xeon system running a fault-only test, non-zero-fill (accessing
one byte per page to force the fault from a distinct file mapping per
thread), on 32 threads, is now able to achieve an aggregate of 17M faults
per second, an almost unheard of number.  And the SMP collision column is
completely blank... zero collisions during the test.

(4) NUMA awareness is relatively difficult to make work in a kernel because
user processes need to be able to shift around between cpu cores in order
for the scheduler to efficiently maximize performance.  However there is a
major advantage to having a NUMA-aware kernel allocator for short-run
programs, such as we find happening during a bulk build.  The NUMA
awareness also greatly reduces SMP conflicts in the VM page queues
(DragonFly uses the page-coloring indices as part of its NUMA
calculation).  It is possible for the kernel memory allocator running on
each individual thread to be 100% non-conflicting with any other thread,
not even sharing lock structures (let alone having lock collisions).

Adding NUMA awareness to the system thus reduces memory stalls in the VM
system, improving best-case performance further.

It should be noted that for most normal (non-specialized) workloads, such
as generic services running on the system, even if they are saturating
available cpu resources, that NUMA doesn't actually add a whole lot.  The
IPC will not improve by much on a Xeon because the L2 and L3 caches are big
and work very well.  But in situations where a machine's memory is being
stressed to the limit, NUMA can help.  And it can also help for PCIe
accesses which is something we will be wanting to address at some point.

Intel's PCM utilities (pkg install intel-pcm; kldload cpuctl), such as
pcm.x, make it possible to measure cross-socket activity.  Prior to the NUM
code going in the system ran approximately 1:1.  Half the memory accesses
were local, half were remote (inter-socket).  After the NUMA changes the
ratio increased to between 5:1 an 10:1 on the Xeon system during a synth
bulk build.  So some significant improvement there, albeit perhaps not as
big an improvement in actual performance as one might expect.

--

In the grand scheme of things each of these optimizations provides a small
incremental improvement.    A full synth run on 30-November yielded:

30-Nov-2016
    Initial queue size: 23476
        packages built: 23334
               ignored: 8
               skipped: 111
                failed: 23
    Duration: 22:28:12

A similar synth run that just finished today yielded:

09-Jan-2016
Initial queue size: 24701
    packages built: 24385
           ignored: 78
           skipped: 226
            failed: 12
    Duration: 20:44:20

And it should be noted the new synth run not only ran in 7% less time, it
also built over 1000 additional packages than the first.  I can't calculate
an exact percentage improvement in performance, but my guess is the work
since 30 November improved the bulk build times by over 11% on a
per-package basis.  That's pretty significant considering how well the
system was optimized for SMP and multi-socket from work earlier in the
year, before this most recent round.

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20170109/b1f4ebdd/attachment-0002.htm>