In-pipeline instruction timing tests

Mon Feb 9 10:18:04 PST 2004

    Here are some basic instruction timing tests.  My particular interest
    is in the compare/jz/addl-to-mem test verses a cmpxchgl or locked
    cmpxchgl.   The compare/jz/addl-to-mem test simulates the new token
    code overhead (minus the %fs load-from-memory which I cannot easily
    simulate from userland), while a locked compare-exchange simulates 
    a mutex.

    Note in particular that a cmp/jz/addl sequence seems to be far better
    pipelined on both the AMD64 and a P4 then a cmpxchgl no matter which
    way you turn it, and that *ANY* locked bus cycle instruction does really
    horrible things to the cpu's pipeline.

				2xP3	AMD64	1xP4
				1.2GHz	3200+	1.7GHz

cpu_add, addl to mem		1.535ns	0.194ns	0ns (1)
cpu_ladd, lock; addl to mem	37.50ns 7.869ns 69.660ns
cpu_call, 1 call/ret		3.934ns 1.921ns 3.550ns
cpu_cmpadd, cmp/jz/addl	mem	4.027ns	0.583ns 0.765ns
cpu_cmpexg, cmpex		6.420ns	2.169ns	7.100ns
cpu_lcmpext,			42.84ns 7.479ns 72.35ns

	note(1): addl to mem is completely absorbed or almost
	completely absorbed by the cpu's pipeline in this test.

    In anycase, this really solidifies my desire to avoid locked bus
    cycle instructions.

						-Matt
						Matthew Dillon
						<dillon at xxxxxxxxxxxxx>