Performance results / VM related SMP locking work - committed
dillon at apollo.backplane.com
Tue Oct 18 15:56:28 PDT 2011
Ok, here's an update on the performance improvements, the patch
has now been committed to master and we are debating the 2.12
status for it.
I have repeated the last posting's results and added another row for
tests run with the committed patch, which improved things even more.
The 48-core box actually feels like a 48-core box now.
Test Machine specs:
monster: CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
48-cores, 64G ram, running 64-bit DFly 1.11 master
test29: CPU: AMD Phenom(tm) II X4 820 Processor (2799.91-MHz K8-class CPU)
4-cores, 8G ram, running 64-bit DFly 1.11 master
Monster is a 48-core opteron (4 cpu sockets). Test29 is a quad-core
Phenom II 820 (Deneb). On a per-core basis Test29 is about 1.47x
faster, and of course it is significantly faster dealing with contention
since it is a single-chip cpu vs the 4-socket monster. The many
cores monster is a very good environment for contention testing.
The tests below do not test paging to swap. There is plenty of memory
to cache the source trees, object trees, and the wost-case run-time
memory footprint. These are strictly kernel/cpu contention tests
running in a heavily fork/exec'd environment (aka buildworld -j N).
Because even parallel (-j) buildworlds have a lot of bottlenecks there
just isn't a whole lot of difference between -j 40 and -j 8 on monster.
Usually the CC's run in parllel but then it bottlenecks at the LD line
(as well as waiting for the 'last' cc to finish, which happens a lot
when the buildworld is working on GCC). The slower opteron cores
become very obvious during these bottleneck moments. Only the libraries
like libc or a kernel NO_MODULES build has enough source files to
actually fan-out to all available cpus.
That said, buildworlds exercise more edge cases in the kernel than
a kernel NO_MODULES build, so I prefer using buildworlds for general
testing. I have some -j 4 tests on test29 and monster for a buildkernel
at the end.
To better-utilize available cores on monster the main VM contention
test runs FOUR buildworld -j 40's in parallel instead of one.
The realtime numbers are what matter here (the 'real' column).
Note that the 4x numbers are taken from just one of the builds,
but all four finish at the same time.
monster buildworld -j 40 timings 1x prepatch:
2975.43 real 4409.48 user 12971.42 sys
2999.20 real 4423.16 user 13014.44 sys
monster buildworld -j 40 timings 1x postpatch: (13.3% improvement)
2587.42 real 4328.87 user 8551.91 sys
monster buildworld -j 40 timings 1x COMMIT: (14.1% improvement) <<<
2577.46 real 4125.42 user 13079.62 sys
2552.94 real 4087.60 user 13085.19 sys
monster buildworld -j 40 timings 4x prepatch:
8302.17 real 4629.97 user 17617.84 sys
8308.01 real 4716.70 user 22330.26 sys
monster buildworld -j 40 timings 4x postpatch: (30.2% improvement)
5799.53 real 5254.76 user 23651.73 sys
5800.49 real 5314.23 user 23499.59 sys
monster buildworld -j 40 timings 4x COMMIT: (49.3% improvement) <<<
4207.85 real 4869.90 user 20673.71 sys
4248.45 real 4899.08 user 21697.11 sys
test29 (quad cpu) buildworld -j 8 timings 1x prepatch:
1964.60 real 3004.07 user 1388.79 sys
1963.29 real 3002.82 user 1386.75 sys
test29 (quad cpu) buildworld -j 8 timings 1x postpatch: (9.87% improvement)
1768.93 real 2864.34 user 1212.24 sys
1771.11 real 2875.10 user 1203.29 sys
test29 (quad cpu) buildworld -j 8 timings 1x COMMIT: (11.9% improvement) <<<
1731.45 real 2749.91 user 1106.81 sys
1729.24 real 2756.48 user 1100.90 sys
BUILDKERNEL NO_MODULES=YES TESTS
This set of tests is using a buildkernel without modules, which has
much greater compiler concurrency verses a buildworld tests since
the make can keep N gcc's running most the time.
137.95 real 277.44 user 155.28 sys monster -j4 (prepatch)
143.44 real 276.47 user 126.79 sys monster -j4 (patch)
122.24 real 281.13 user 97.74 sys monster -j4 (commit) <<<
89.61 real 196.30 user 59.04 sys test29 -j4 (patch)
86.55 real 195.14 user 49.52 sys test28 -j4 (commit) <<<
167.62 real 360.44 user 4148.45 sys monster -j48 (prepatch)
110.26 real 362.93 user 1281.41 sys monster -j48 (patch)
101.68 real 380.67 user 1864.92 sys monster -j48 (commit) <<<
96.37 real 209.52 user 63.77 sys test29 -j48 (patch)
85.72 real 196.93 user 52.08 sys test29 -j48 (commit) <<<
For the kernel build, a 11.4% improvement -j4 on monster (only utilizing
4 of the 48 cores, well, at least as far as make -j goes).
For the kernel build, a 39.4% improvement -j48 on monster, utilizing
all 48 cores.
On the test29 quad-core the numbers weren't expected to improve
a whole lot, and they didn't, because single-chip multi-core spin
locks are very, very fast. Surprisingly though the -j48 build improved
performance by quite a bit, around 11%.
The real improvements are on systems with more cores. Monster, with
48-cores, made for a very good test case.
<dillon at backplane.com>
More information about the Users