Performance results / VM related SMP locking work - prelim VM patch
dillon at apollo.backplane.com
Fri Oct 7 11:02:11 PDT 2011
Justin suggested I post this on users at . I've been working on VM system
concurrency in the kernel and have a patch set in development. This
patch is not stable enough yet for commit so it isn't even in master
yet (and certainly won't make it into the upcoming release), but it's
progressed far enough that I am getting some good performance numbers
out of it which I would like to share.
Test Machine specs:
monster: CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
48-cores, 64G ram, running 64-bit DFly 1.11 master
test29: CPU: AMD Phenom(tm) II X4 820 Processor (2799.91-MHz K8-class CPU)
4-cores, 8G ram, running 64-bit DFly 1.11 master
Monster is a 48-core opteron (4 cpu sockets). Test29 is a quad-core
Phenom II 820 (Deneb). On a per-core basis Test29 is about 1.47x
faster, and of course it is significantly faster dealing with contention
since it is a single-chip cpu vs the 4-socket monster. The many
cores monster is a very good environment for contention testing.
The tests below do not test paging to swap. There is plenty of memory
to cache the source trees, object trees, and the wost-case run-time
memory footprint. These are strictly kernel/cpu contention tests
running in a heavily fork/exec'd environment (aka buildworld -j N).
Because even parallel (-j) buildworlds have a lot of bottlenecks there
just isn't a whole lot of difference between -j 40 and -j 8 on monster.
Usually the CC's run in parllel but then it bottlenecks at the LD line
(as well as waiting for the 'last' cc to finish, which happens a lot
when the buildworld is working on GCC). The slower opteron cores
become very obvious during these bottleneck moments. Only the libraries
like libc or a kernel NO_MODULES build has enough source files to
actually fan-out to all available cpus.
That said, buildworlds exercise more edge cases in the kernel than
a kernel NO_MODULES build, so I prefer using buildworlds for general
testing. I have some -j 4 tests on test29 and monster for a buildkernel
at the end.
To better-utilize available cores on monster the main VM contention
test runs FOUR buildworld -j 40's in parallel instead of one.
The realtime numbers are what matter here (the 'real' column).
Note that the 4x numbers are taken from just one of the builds,
but all four finish at the same time.
monster bulidworld -j 40 timings 1x prepatch:
2975.43 real 4409.48 user 12971.42 sys
2999.20 real 4423.16 user 13014.44 sys
monster bulidworld -j 40 timings 1x postpatch: (13.7% improvement)
2587.42 real 4328.87 user 8551.91 sys
monster buildworld -j 40 timings 4x prepatch:
8302.17 real 4629.97 user 17617.84 sys
8308.01 real 4716.70 user 22330.26 sys
monster buildworld -j 40 timings 4x postpatch: (30.2% improvement)
5799.53 real 5254.76 user 23651.73 sys
5800.49 real 5314.23 user 23499.59 sys
test29 (quad cpu) buildworld -j 8 timings 1x prepatch:
1964.60 real 3004.07 user 1388.79 sys
1963.29 real 3002.82 user 1386.75 sys
test29 (quad cpu) buildworld -j 8 timings 1x postpatch: (9.87% improvement)
1768.93 real 2864.34 user 1212.24 sys
1771.11 real 2875.10 user 1203.29 sys
* Note that comparing test29 to monster isn't useful except as a
self-check. test29 has a single 4xcore cpu chip that runs 1.51x
faster than monster on a per-core basis.
* The reduced contention on monster is very apparent in the 1x 'sys'
numbers. It is a bit unclear why the 4x 'sys' numbers don't reflect
the much-improved real times.
* Generally speaking as we continue to reduce contention, the
4x buildworld -j 40 times should approach the 1x buildworld -j 40
times on monster. The 30% improvement is very significant but
clearly we can do even better. The vm_page allocation and freeing
paths are probably responsible for a lot of the remaining contention.
BUILDKERNEL NO_MODULES=YES TESTS
This set of tests is using a buildkernel without modules, which has
much greater compiler concurrency verses a buildworld tests since
the make can keep N gcc's running most the time.
137.95 real 277.44 user 155.28 sys monster -j4 (prepatch)
143.44 real 276.47 user 126.79 sys monster -j4 (patch)
89.61 real 196.30 user 59.04 sys test29 -j4 (patch)
167.62 real 360.44 user 4148.45 sys monster -j48 (prepatch)
110.26 real 362.93 user 1281.41 sys monster -j48 (patch)
96.37 real 209.52 user 63.77 sys test29 -j48 (patch)
* The -j 4 builds don't have much contention before or after the patch,
monster was actually slightly faster pre-patch (though I'm not being
scientific here, it's within the realm of error on my part).
* Keeping in mind that test29 is a quad-core the parallelism is
really only 4, I use a -j 48 build to approximate other non-SMP
related overheads on test29 for comparison to monster's -j 48 build
in the last two results.
Of course, -j 48 on monster is a different story entirely. That will
use all 48 cores in this test.
* test29 is 1.5x faster than monster, hence the 4-core limited results
make sense (89.61 vs 143.44 seconds, which is 1.60x).
* The monster -j 48 kernel build without modules has better compiler
concurrency vs a buildworld. So the final two lines show how
the contention effects the build.
Monster was able to reduce build times from 143 to 110 seconds with
-j 48 but as you can see the system time balooned up massively due
to contention that is still present.
Monster -j 48 pre-patch vs post-patch shows how well contention was
reduced in the patch. 167 seconds vs 110 seconds, a 34.1% improvement!
system time was reduced 4148 seconds to 1281 seconds.
The interesting thing to note here is the 1281 seconds of system time
the 48-core 48-process compiler concurrency test ate. This clearly
shows what contention still remains. From the ps output (not shown)
it's still mostly associated with the vm_token (probably the vm_page
allocation and freeing path) and vmobj_token (probably the vm_fault
path through the vm_map and vm_object chain). I'll focus on these
later on once I've stabilized what I have already.
Even a -j N kernel build with NO_MODULES=TRUE has two major bottlenecks:
the make depend and the link line at the end, which together account for
(off the cuff) somewhere around ~45 seconds of serialized single-core
cpu on monster.
So even in the ideal case monster probably couldn't do this build in
less than about ~55 seconds or so.
<dillon at backplane.com>
More information about the Users