Phoronix benchmarks DF 3.2 vs Ubuntu - question
alex at alexhornung.com
Fri Jan 4 14:24:09 PST 2013
On 28/11/12 09:36, elekktretterr at exemail.com.au wrote:
>> I didn't really want to comment on this, but well - here I go anyway.
>> Most (all?) the benchmarks mentioned in the article are
>> hardware/compiler benchmarks. Whoever tells you that Dhrystone is a
>> benchmark for anything but the compiler (and the underlying hardware)
>> doesn't know what he's talking about. At a decent optimization level it
>> won't even touch libc functions other than for reporting (it'll use the
>> compiler's optimized strcpy, etc).
> I was commenting on the Himeno benchmark which apparently is as good on
> Debian GNU/kFreeBSD as on Debian GNU/Linux but much slower on pure
> FreeBSD. Whether it's libc or compiler which makes the difference, I don't
> know, but it's clearly not the kernel.
Matt, Venkatesh and me have been investigating the himeno benchmark
results a bit closer now that we have AVX support and can rule out the
compiler as a difference.
The tests I ran reflected the gap in performance that the phoronix
benchmark run showed: we are around 2 - 2.5x slower than linux on the
exact same generated code (compared generated assembly).
Turns out the issue is L1 cache thrashing. On linux, the code
allocates the following matrixes (as an example):
while on Dragonfly, for example, it allocates them at the following
Since the L1 cache is (in this and most cases) virtually indexed and
physically tagged, the fact that the lower (virtual) address bits, used
for indexing, are the same on dragonfly, means that we hit the
associativity limit of the cache, while linux is hitting different
lines/indexes in the cache.
Changing the allocation in the himeno benchmark to using sbrk with an
appropriate offset, the generated addresses look like this:
and with otherwise unchanged code, the performance on DragonFly is
pretty much the same as on linux:
Core i7 2720QM results (note: results fluctuate a bit, but they are
pretty much always +- 60 MFLOPS):
DragonFly (dmalloc): 553 MFLOPS
DragonFly (sbrk): 1311 MFLOPS
Linux: 1275 MFLOPS
The solution to this is changing our dmalloc and or nmalloc to either
simply offset (large) allocations, or even do proper cache colouring.
More information about the Users