Performance results / VM related SMP locking work - committed

Tue Oct 18 15:56:28 PDT 2011

    Ok, here's an update on the performance improvements, the patch
    has now been committed to master and we are debating the 2.12
    status for it.

    I have repeated the last posting's results and added another row for
    tests run with the committed patch, which improved things even more.

    The 48-core box actually feels like a 48-core box now.

    Test Machine specs:

    monster: CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
	     48-cores, 64G ram, running 64-bit DFly 1.11 master

    test29:  CPU: AMD Phenom(tm) II X4 820 Processor (2799.91-MHz K8-class CPU)
	     4-cores, 8G ram, running 64-bit DFly 1.11 master

    Monster is a 48-core opteron (4 cpu sockets).  Test29 is a quad-core
    Phenom II 820 (Deneb).  On a per-core basis Test29 is about 1.47x
    faster, and of course it is significantly faster dealing with contention
    since it is a single-chip cpu vs the 4-socket monster.  The many
    cores monster is a very good environment for contention testing.

    --

    The tests below do not test paging to swap.  There is plenty of memory
    to cache the source trees, object trees, and the wost-case run-time
    memory footprint.  These are strictly kernel/cpu contention tests
    running in a heavily fork/exec'd environment (aka buildworld -j N).

    Because even parallel (-j) buildworlds have a lot of bottlenecks there
    just isn't a whole lot of difference between -j 40 and -j 8 on monster.
    Usually the CC's run in parllel but then it bottlenecks at the LD line
    (as well as waiting for the 'last' cc to finish, which happens a lot
    when the buildworld is working on GCC).  The slower opteron cores
    become very obvious during these bottleneck moments.  Only the libraries
    like libc or a kernel NO_MODULES build has enough source files to
    actually fan-out to all available cpus.

    That said, buildworlds exercise more edge cases in the kernel than
    a kernel NO_MODULES build, so I prefer using buildworlds for general
    testing.  I have some -j 4 tests on test29 and monster for a buildkernel
    at the end.

				    RESULTS

    To better-utilize available cores on monster the main VM contention
    test runs FOUR buildworld -j 40's in parallel instead of one.

    The realtime numbers are what matter here (the 'real' column).
    Note that the 4x numbers are taken from just one of the builds,
    but all four finish at the same time.

monster buildworld -j 40 timings 1x prepatch: 
     2975.43 real      4409.48 user     12971.42 sys
     2999.20 real      4423.16 user     13014.44 sys
monster buildworld -j 40 timings 1x postpatch: 		(13.3% improvement)
     2587.42 real      4328.87 user      8551.91 sys
monster buildworld -j 40 timings 1x COMMIT: 		(14.1% improvement) <<<
     2577.46 real      4125.42 user     13079.62 sys
     2552.94 real      4087.60 user     13085.19 sys

monster buildworld -j 40 timings 4x prepatch:
     8302.17 real      4629.97 user     17617.84 sys
     8308.01 real      4716.70 user     22330.26 sys
monster buildworld -j 40 timings 4x postpatch: 		(30.2% improvement)
     5799.53 real      5254.76 user     23651.73 sys
     5800.49 real      5314.23 user     23499.59 sys
monster buildworld -j 40 timings 4x COMMIT: 		(49.3% improvement) <<<
     4207.85 real      4869.90 user     20673.71 sys
     4248.45 real      4899.08 user     21697.11 sys


test29 (quad cpu) buildworld -j 8 timings 1x prepatch:
     1964.60 real      3004.07 user      1388.79 sys
     1963.29 real      3002.82 user      1386.75 sys
test29 (quad cpu) buildworld -j 8 timings 1x postpatch:	(9.87% improvement)
     1768.93 real      2864.34 user      1212.24 sys
     1771.11 real      2875.10 user      1203.29 sys
test29 (quad cpu) buildworld -j 8 timings 1x COMMIT:	(11.9% improvement) <<<
     1731.45 real      2749.91 user      1106.81 sys
     1729.24 real      2756.48 user      1100.90 sys

		    BUILDKERNEL NO_MODULES=YES TESTS

    This set of tests is using a buildkernel without modules, which has
    much greater compiler concurrency verses a buildworld tests since
    the make can keep N gcc's running most the time.

     137.95 real       277.44 user       155.28 sys  monster -j4 (prepatch)
     143.44 real       276.47 user       126.79 sys  monster -j4 (patch)
     122.24 real       281.13 user        97.74 sys  monster -j4 (commit)   <<<

      89.61 real       196.30 user        59.04 sys  test29 -j4 (patch)
      86.55 real       195.14 user        49.52 sys  test28 -j4 (commit)    <<<


     167.62 real       360.44 user      4148.45 sys  monster -j48 (prepatch)
     110.26 real       362.93 user      1281.41 sys  monster -j48 (patch)
     101.68 real       380.67 user      1864.92 sys  monster -j48 (commit)  <<<

      96.37 real       209.52 user        63.77 sys  test29 -j48 (patch)
      85.72 real       196.93 user        52.08 sys  test29 -j48 (commit)   <<<


    For the kernel build, a 11.4% improvement -j4 on monster (only utilizing
    4 of the 48 cores, well, at least as far as make -j goes).

    For the kernel build, a 39.4% improvement -j48 on monster, utilizing
    all 48 cores.

    On the test29 quad-core the numbers weren't expected to improve
    a whole lot, and they didn't, because single-chip multi-core spin
    locks are very, very fast.  Surprisingly though the -j48 build improved
    performance by quite a bit, around 11%.

    The real improvements are on systems with more cores.  Monster, with
    48-cores, made for a very good test case.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>