Performance results / VM related SMP locking work - committed (3)

Fri Oct 28 16:33:38 PDT 2011

    Another huge performance improvement for many-cores systems.  I removed
    the last bottleneck spinlock in the VM system.  This spinlock was only
    locking the PQ_INACTIVE vm_page_queue for a very short period of time
    but with 48 cores it was enough to limit the VM fault rate.  With the
    fix concurrent compiles go much, MUCH faster, a major improvement on
    top of the major improvement prior commits had.

    Test Machine specs:

    monster: CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
	     48-cores, 64G ram, running 64-bit DFly 1.13 master

    test29:  CPU: AMD Phenom(tm) II X4 820 Processor (2799.91-MHz K8-class CPU)
	     4-cores, 8G ram, running 64-bit DFly 1.13 master

    Monster is a 48-core opteron (4 cpu sockets).  Test29 is a quad-core
    Phenom II 820 (Deneb).  On a per-core basis Test29 is about 1.47x
    faster, and of course it is significantly faster dealing with contention
    since it is a single-chip cpu vs the 4-socket monster.  The many
    cores monster is a very good environment for contention testing.

    --

    The tests below do not test paging to swap.  There is plenty of memory
    to cache the source trees, object trees, and the wost-case run-time
    memory footprint.  These are strictly kernel/cpu contention tests
    running in a heavily fork/exec'd environment (aka buildworld -j N).

    Because even parallel (-j) buildworlds have a lot of bottlenecks there
    just isn't a whole lot of difference between -j 40 and -j 8 on monster.
    Usually the CC's run in parllel but then it bottlenecks at the LD line
    (as well as waiting for the 'last' cc to finish, which happens a lot
    when the buildworld is working on GCC).  The slower opteron cores
    become very obvious during these bottleneck moments.  Only the libraries
    like libc or a kernel NO_MODULES build has enough source files to
    actually fan-out to all available cpus.

    That said, buildworlds exercise more edge cases in the kernel than
    a kernel NO_MODULES build, so I prefer using buildworlds for general
    testing.  I have some -j 4 tests on test29 and monster for a buildkernel
    at the end.

				    RESULTS

    To better-utilize available cores on monster the main VM contention
    test runs FOUR buildworld -j 40's in parallel instead of one.

    The realtime numbers are what matter here (the 'real' column).
    Note that the 4x numbers are taken from just one of the builds,
    but all four finish at the same time.

monster buildworld -j 40 timings 1x prepatch: 		(BASELINE)
     2975.43 real      4409.48 user     12971.42 sys
     2999.20 real      4423.16 user     13014.44 sys
monster buildworld -j 40 timings 1x postpatch: 		(+14.9% improvement)
     2587.42 real      4328.87 user      8551.91 sys
monster buildworld -j 40 timings 1x commit 1: 		(+15.4% improvement)
     2577.46 real      4125.42 user     13079.62 sys
     2552.94 real      4087.60 user     13085.19 sys
monster buildworld -j 40 timings 1x commit 3: 		(+43.8% improvement)<<<
     2068.67 real      4124.96 user      4227.38 sys
     2062.34 real      4139.10 user      4301.78 sys

monster buildworld -j 40 timings 4x prepatch:		(BASELINE)
     8302.17 real      4629.97 user     17617.84 sys
     8308.01 real      4716.70 user     22330.26 sys
monster buildworld -j 40 timings 4x postpatch: 		(+43.2% improvement)
     5799.53 real      5254.76 user     23651.73 sys
     5800.49 real      5314.23 user     23499.59 sys
monster buildworld -j 40 timings 4x commit 1: 		(+96.8% improvement)
     4207.85 real      4869.90 user     20673.71 sys
     4248.45 real      4899.08 user     21697.11 sys
monster buildworld -j 40 timings 4x commit 2:		(+107% improvement)
     3943.25 real      4630.76 user     21062.91 sys
monster buildworld -j 40 timings 4x commit 3:		(+229% improvement)<<<
     2518.78 real      4344.02 user      4674.45 sys

test29 (quad cpu) buildworld -j 8 timings 1x prepatch:	(BASELINE)
     1964.60 real      3004.07 user      1388.79 sys
     1963.29 real      3002.82 user      1386.75 sys
test29 (quad cpu) buildworld -j 8 timings 1x postpatch:	(+11.07% improvement)
     1768.93 real      2864.34 user      1212.24 sys
     1771.11 real      2875.10 user      1203.29 sys
test29 (quad cpu) buildworld -j 8 timings 1x commit 1:	(+13.4% improvement)
     1731.45 real      2749.91 user      1106.81 sys
     1729.24 real      2756.48 user      1100.90 sys
test29 (quad cpu) buildworld -j 8 timings 1x commit 3:	(+7.4% improvement)
     1828.75 real      2737.53 user      1387.10 sys


    The results show a truly massive improvement in performance on our
    48-core machine.  A +229% improvement is well over 3x as fast.  The
    build times for the completion of four concurrent buildworlds (that is,
    all four finish at the same time, 2500 seconds after all four were
    started) is only 500 seconds slower than for one, meaning that we are
    getting very good concurrency now.

		    BUILDKERNEL NO_MODULES=YES TESTS

    This set of tests is using a buildkernel without modules, which has
    much greater compiler concurrency verses a buildworld tests since
    the make can keep N gcc's running most the time.

     137.95 real       277.44 user       155.28 sys  monster -j4 (prepatch)
     143.44 real       276.47 user       126.79 sys  monster -j4 (patch)
     122.24 real       281.13 user        97.74 sys  monster -j4 (commit)
     127.16 real       274.20 user       108.37 sys  monster -j4 (commit 3)


      89.61 real       196.30 user        59.04 sys  test29 -j4 (patch)
      86.55 real       195.14 user        49.52 sys  test29 -j4 (commit)
      93.77 real       195.94 user        67.68 sys  test29 -j4 (commit 3)

     167.62 real       360.44 user      4148.45 sys  monster -j48 (prepatch)
     110.26 real       362.93 user      1281.41 sys  monster -j48 (patch)
     101.68 real       380.67 user      1864.92 sys  monster -j48 (commit 1)
      59.66 real       349.45 user       208.59 sys  monster -j48 (commit 3)<<<

      96.37 real       209.52 user        63.77 sys  test29 -j48 (patch)
      85.72 real       196.93 user        52.08 sys  test29 -j48 (commit 1)
      90.01 real       196.91 user        70.32 sys  test29 -j48 (commit 3)

    Kernel build results are as expected for the most part.  -j 48 build
    times on the many-cores monster are GREATLY improved, from 101 seconds
    to 59.66 seconds (and down from 167 seconds before this work began).

    That's a +181% improvement, almost 3x faster.

    The -j 4 build and the quad-core test29 build were not expected to show
    any improvement since there isn't really any spinlock contention with
    only 4 cores.  There was a slight nerf on test28 (the quad-core box) but
    that might be related to some of the lwkt_yield()s added and not so
    much the PQ_INACTIVE/PQ_ACTIVE vm_page_queues[] changes.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>