lkwt in DragonFly

Sat Feb 7 17:08:14 PST 2004

:  But if your vnode lock observations are accurate (i.e., that we hit
:  the same-cpu case a lot more than the foreign cpu case), then even
:  with lock-prefixed instructions we should not be noticing the same
:  penalty as if cache invalidations need to occur.
:
:  I have always wondered what the effects of a lock-prefixed instruction
:  are with respect to the data caches; in other words, say I atomically
:  grab a mutex and then release it only to grab it again on the same cpu
:  a little while later, then the cost of the regrabbing of the lock
:  atomically should not be the same as when I am initially atomically
:  grabbing a mutex previously owned by another CPU.  So I dug and dug
:  and as it turns out, on processors later than the Pentium Pro, my
:  assumption seems to be correct:

    Yes, the intel documentation is correct, and both John Baldwin and I
    have run tests conforming that relocking on the same cpu is much
    faster.  However, intel is also correct in regards to pipeline effects,
    the last little bit you included from the document:

:Unfortunately, the impact on the processor pipelines is not the same.
:[...]

    A later respondant indicated that the performance penalty was a factor
    of 16.  I'm not sure that is correct, but I do know that cmpexg has
    a horrible effect on the processor pipeline.   When I get home I will run
    some tests.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>

:   So I guess the point is merely that for reasonably warm caches, the
:   overhead of a bus-locked instruction is mitigated.  Although, as also
:   noted, the fact that ordering needs to be ensured still sucks. 
:
:-- 
:Bosko Milekic  *  bmilekic at xxxxxxxxxxxxxxxx  *  bmilekic at xxxxxxxxxxx