pipe testing and kernel copyin/copyout/bcopy performance

Thu Apr 29 13:41:43 PDT 2004

:I've also read something about caches not being updated by using SSE 
:instructions such that if you refer to the memory you just copied that the 
:wins for having used SSE in the copy are much diminished.
:
:Dave

    These are the so-called 'non-temporal' instructions.  So, for example,
    the standard 128 bit move instruction is 'movdqa' or 'movdqu' (for
    double-quad-aligned or double-quad-unaligned).  The non-temporal 
    version is 'movntdq'.

    The non-temporal instructions supposedly queue directly to memory and
    do not 'pollute' the caches.  You can max out memory bandwidth using
    non-temporal instructions (on the Athlon 64 this is about double the
    write bandwidth you can get using normal writes).  However, the problem
    with this is that even maxed out memory only has 1/4 the bandwidth of
    the L1 cache, so if you write a general bcopy() function using
    non-termporal writes it will have great performance for huge multi-megabyte
    copies but horrendously bad performance for block sizes that fit in
    the L1/L2 caches, like 16K, 32K, 64K, even 256K (which easily fits in
    a Athlon 64's L2 cache).

    You also cannot mix normal writes with non-temporal writes.  Well, you
    *can* mix the instructions, but the result will be truely hideous (verses
    simply horrible) memory performance... I saw a 3GByte/sec test drop to
    < 100 MBytes/sec when I replaced half the movedqa's with moventdq's.
    I tried using non-temporals with XMM, MXX, and even integer registers
    (movnti instruction), with the same hideous results.

    The effects are probably different on Intel chips.  All the testing I did
    was on Athlon 64's (Athlon 3200+).

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>