Phoronix benchmarks DF 3.2 vs Ubuntu - question

Fri Jan 4 14:24:09 PST 2013

Hi,

On 28/11/12 09:36, elekktretterr at exemail.com.au wrote:
>> I didn't really want to comment on this, but well - here I go anyway.
>> Most (all?) the benchmarks mentioned in the article are
>> hardware/compiler benchmarks. Whoever tells you that Dhrystone is a
>> benchmark for anything but the compiler (and the underlying hardware)
>> doesn't know what he's talking about. At a decent optimization level it
>> won't even touch libc functions other than for reporting (it'll use the
>> compiler's optimized strcpy, etc).
> 
> I was commenting on the Himeno benchmark which apparently is as good on
> Debian GNU/kFreeBSD as on Debian GNU/Linux but much slower on pure
> FreeBSD. Whether it's libc or compiler which makes the difference, I don't
> know, but it's clearly not the kernel.

Matt, Venkatesh and me have been investigating the himeno benchmark
results a bit closer now that we have AVX support and can rule out the
compiler as a difference.

The tests I ran reflected the gap in performance that the phoronix
benchmark run showed: we are around 2 - 2.5x slower than linux on the
exact same generated code (compared generated assembly).

Turns out the issue is L1 cache thrashing. On linux, the code[1]
allocates the following matrixes (as an example):
mat->m: 0x7f6352656010
mat->m: 0x7f6352455010
mat->m: 0x7f6352254010
mat->m: 0x7f6352053010
mat->m: 0x7f6351852010
mat->m: 0x7f6351251010
mat->m: 0x7f6350c50010

while on Dragonfly, for example, it allocates them at the following
addresses:
mat->m: 0x8000200000
mat->m: 0x8000600000
mat->m: 0x8000a00000
mat->m: 0x8000e00000
mat->m: 0xc000800000
mat->m: 0x10000080000
mat->m: 0x10000880000

Since the L1 cache is (in this and most cases) virtually indexed and
physically tagged, the fact that the lower (virtual) address bits, used
for indexing, are the same on dragonfly, means that we hit the
associativity limit of the cache, while linux is hitting different
lines/indexes in the cache.

Changing the allocation in the himeno benchmark to using sbrk with an
appropriate offset, the generated addresses look like this:

mat->m: 0x6026c8
mat->m: 0x8036c8
mat->m: 0xa046c8
mat->m: 0xc056c8
mat->m: 0xe066c8
mat->m: 0x16076c8
mat->m: 0x1c086c8

and with otherwise unchanged code, the performance on DragonFly is
pretty much the same as on linux:

Core i7 2720QM results (note: results fluctuate a bit, but they are
pretty much always +- 60 MFLOPS):

DragonFly (dmalloc):  553 MFLOPS
DragonFly (sbrk):    1311 MFLOPS
Linux:               1275 MFLOPS

The solution to this is changing our dmalloc and or nmalloc to either
simply offset (large) allocations, or even do proper cache colouring.

Cheers!
Alex

[1]: http://leaf.dragonflybsd.org/~alexh/himenobmtxpa.c