Performance results / VM related SMP locking work - prelim VM patch

Fri Oct 7 11:02:11 PDT 2011

    Justin suggested I post this on users at .  I've been working on VM system
    concurrency in the kernel and have a patch set in development.  This
    patch is not stable enough yet for commit so it isn't even in master
    yet (and certainly won't make it into the upcoming release), but it's
    progressed far enough that I am getting some good performance numbers
    out of it which I would like to share.

    Test Machine specs:

    monster: CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
	     48-cores, 64G ram, running 64-bit DFly 1.11 master

    test29:  CPU: AMD Phenom(tm) II X4 820 Processor (2799.91-MHz K8-class CPU)
	     4-cores, 8G ram, running 64-bit DFly 1.11 master

    Monster is a 48-core opteron (4 cpu sockets).  Test29 is a quad-core
    Phenom II 820 (Deneb).  On a per-core basis Test29 is about 1.47x
    faster, and of course it is significantly faster dealing with contention
    since it is a single-chip cpu vs the 4-socket monster.  The many
    cores monster is a very good environment for contention testing.

    --

    The tests below do not test paging to swap.  There is plenty of memory
    to cache the source trees, object trees, and the wost-case run-time
    memory footprint.  These are strictly kernel/cpu contention tests
    running in a heavily fork/exec'd environment (aka buildworld -j N).

    Because even parallel (-j) buildworlds have a lot of bottlenecks there
    just isn't a whole lot of difference between -j 40 and -j 8 on monster.
    Usually the CC's run in parllel but then it bottlenecks at the LD line
    (as well as waiting for the 'last' cc to finish, which happens a lot
    when the buildworld is working on GCC).  The slower opteron cores
    become very obvious during these bottleneck moments.  Only the libraries
    like libc or a kernel NO_MODULES build has enough source files to
    actually fan-out to all available cpus.

    That said, buildworlds exercise more edge cases in the kernel than
    a kernel NO_MODULES build, so I prefer using buildworlds for general
    testing.  I have some -j 4 tests on test29 and monster for a buildkernel
    at the end.

				    RESULTS

    To better-utilize available cores on monster the main VM contention
    test runs FOUR buildworld -j 40's in parallel instead of one.

    The realtime numbers are what matter here (the 'real' column).
    Note that the 4x numbers are taken from just one of the builds,
    but all four finish at the same time.

monster bulidworld -j 40 timings 1x prepatch: 
     2975.43 real      4409.48 user     12971.42 sys
     2999.20 real      4423.16 user     13014.44 sys
monster bulidworld -j 40 timings 1x postpatch: 		(13.7% improvement)
     2587.42 real      4328.87 user      8551.91 sys

monster buildworld -j 40 timings 4x prepatch:
     8302.17 real      4629.97 user     17617.84 sys
     8308.01 real      4716.70 user     22330.26 sys
monster buildworld -j 40 timings 4x postpatch: 		(30.2% improvement)
     5799.53 real      5254.76 user     23651.73 sys
     5800.49 real      5314.23 user     23499.59 sys

test29 (quad cpu) buildworld -j 8 timings 1x prepatch:
     1964.60 real      3004.07 user      1388.79 sys
     1963.29 real      3002.82 user      1386.75 sys
test29 (quad cpu) buildworld -j 8 timings 1x postpatch:	(9.87% improvement)
     1768.93 real      2864.34 user      1212.24 sys
     1771.11 real      2875.10 user      1203.29 sys

    * Note that comparing test29 to monster isn't useful except as a
      self-check.  test29 has a single 4xcore cpu chip that runs 1.51x
      faster than monster on a per-core basis.

    * The reduced contention on monster is very apparent in the 1x 'sys'
      numbers.  It is a bit unclear why the 4x 'sys' numbers don't reflect
      the much-improved real times.

    * Generally speaking as we continue to reduce contention, the
      4x buildworld -j 40 times should approach the 1x buildworld -j 40
      times on monster.  The 30% improvement is very significant but
      clearly we can do even better.  The vm_page allocation and freeing
      paths are probably responsible for a lot of the remaining contention.


		    BUILDKERNEL NO_MODULES=YES TESTS

    This set of tests is using a buildkernel without modules, which has
    much greater compiler concurrency verses a buildworld tests since
    the make can keep N gcc's running most the time.

	137.95 real       277.44 user       155.28 sys	monster -j4 (prepatch)
	143.44 real       276.47 user       126.79 sys	monster -j4 (patch)
	 89.61 real       196.30 user        59.04 sys  test29 -j4 (patch)

	167.62 real       360.44 user      4148.45 sys  monster -j48 (prepatch)
	110.26 real       362.93 user      1281.41 sys	monster -j48 (patch)
	 96.37 real       209.52 user        63.77 sys	test29 -j48 (patch)

    * The -j 4 builds don't have much contention before or after the patch,
      monster was actually slightly faster pre-patch (though I'm not being
      scientific here, it's within the realm of error on my part).

    * Keeping in mind that test29 is a quad-core the parallelism is
      really only 4, I use a -j 48 build to approximate other non-SMP
      related overheads on test29 for comparison to monster's -j 48 build
      in the last two results.

      Of course, -j 48 on monster is a different story entirely.  That will
      use all 48 cores in this test.

    * test29 is 1.5x faster than monster, hence the 4-core limited results
      make sense (89.61 vs 143.44 seconds, which is 1.60x).

    * The monster -j 48 kernel build without modules has better compiler
      concurrency vs a buildworld.  So the final two lines show how
      the contention effects the build.

      Monster was able to reduce build times from 143 to 110 seconds with
      -j 48 but as you can see the system time balooned up massively due
      to contention that is still present.

      Monster -j 48 pre-patch vs post-patch shows how well contention was
      reduced in the patch.  167 seconds vs 110 seconds, a 34.1% improvement!
      system time was reduced 4148 seconds to 1281 seconds.

    The interesting thing to note here is the 1281 seconds of system time
    the 48-core 48-process compiler concurrency test ate.  This clearly
    shows what contention still remains.  From the ps output (not shown)
    it's still mostly associated with the vm_token (probably the vm_page
    allocation and freeing path) and vmobj_token (probably the vm_fault
    path through the vm_map and vm_object chain).  I'll focus on these
    later on once I've stabilized what I have already.

    Even a -j N kernel build with NO_MODULES=TRUE has two major bottlenecks:
    the make depend and the link line at the end, which together account for
    (off the cuff) somewhere around ~45 seconds of serialized single-core
    cpu on monster.

    So even in the ideal case monster probably couldn't do this build in
    less than about ~55 seconds or so.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>