token passing & IPIs

Thu Jul 17 12:47:42 PDT 2003

:Hi Matt,
:
:I stumbled over your DragonFly site and read the design information
:with interest. The design looks a bit AmigaOS inspired to me, except
:that you use multithreaded servers. 
:
:One thing I wanted to comment on is your locking design using token
:passing primitives.  It says you want to use IPIs to pass tokens
:around. One problem I see with that is that IPIs can be rather slow.
:Last time I measured them one IPI on a fairly fast system (4 CPU
:Opteron) was around ~1000 cycles from sending to single CPU entry IPI
:handler [this was on Linux/x86-64, but that already has a quite well
:optimized path for it]. I'm not sure where the overhead comes from,
:I suspect it adds together from all the microcode that needs 
:to be executed. On P4 based systems I would expect it to be slower,
:they are very slow at anything exceptional that disrupts their trace
:cache.

    I am CCing this response to my list for archival purposes, because I
    wound up spending a good hour formulating it.  Good questions!

    Yes, IPIs are very expensive.  An IPI can take 1uS or longer.
    The problem is that the inter-APIC bus between CPUs is (1) slow and
    (2) the APIC itself is a microprocessor from what I can tell and doesn't
    run instructions nearly as fast as the main cpu does.

    The key to using an IPI properly is to use it asynchronously whenever
    possible... that is, to not have to wait for a reply to your IPI message.
    A great deal of DragonFly's IPI operations are asynchronous in this
    manner.  The LWKT scheduler for example is asynchronous.  The token code
    is currently synchronous and thus non-optimal but there are plenty of
    optimizations that can be made to reduce non-optimal instances, such as
    proactively handing ownership of a token off to another cpu (which
    can remove 99.9% of the synchronous IPIs that occur in a typical pipe
    device, for example).

    I'm not really worried about latency in the asynchronous case because
    the source cpu does not have to wait for delivery or response... it's 
    a matter of a few instructions and then the source cpu goes on to other
    things.  Sure, the source cpu might occassionally have to spin waiting
    for the previous IPI message it sent to be delivered before it can 
    queue the next one, but that is pipelined fairly well by the APIC
    and doesn't normally happen in the critical (performance-killing) path.

:A lot of IPIs can even live-lock a CPU, I've seen that problem with remote
:TLB flushes on some extreme workloads.

    This is a design issue only.  Basically it comes down to the fact
    that interrupt disablement (cli) will also block IPIs, and that if you
    take an IPI which itself can send an IPI you have to be sure to enable
    interrupts while you are looping in order to allow incoming IPIs to
    at least be flagged and EOI'd (though not actually run as this could
    lead to a live-locked recursion).  Livelock can only happen if you spin
    on an outgoing IPI request without handling incoming IPIs.  At least that
    is my understanding.  I could be wrong.

:The IPI was non NMI, so it was possible that some CLI delays on the
:target CPU were included, but I did take quite a lot of samples and it
:was always in that range.  I suspect especially on the older Intel
:boxes (P3) it is even slower, because they have a rather underpowered
:APIC bus instead of the fast HyperTransport interconnect of the
:Opteron. 

    Pipelining IPI requests is critical (i.e. starting an IPI command and
    not waiting for it to be accepted before returning, then spinning
    on the previous IPI command pending bit when the next one is sent).

:Passing a cache line in a spinlock around is much less costly. I
:understand you want to use much less locks than a fine grained locked
:kernel (a worthy goal; e.g. Linux can already take >20 locks for a
:single write call and it hurts). But some locks cannot be avoided,
:like the protection for lookups in buffer or directory caches, which
:cannot be handled CPU local because they need to manage system global
:information. With a locking primitive that has so much more overhead
:you will be at a great disadvantage in these important fast paths.
:
:Have you put any thoughts into these issues yet?

    Well, I disagree somewhat on locks not being avoidable.  Consider the
    case where you have two entirely independant machines operating in a
    cluster.  These machines obviously do not have to get low level locks
    to protect each from the other, mainly because each machine has its
    own entirely independant set of caches.

    This implies that it is possible to implement caching algorithms which
    are far more aware of an MP environment and make better use of per-cpu
    caches (which do not require locks), possibly even to the point of
    duplicating meta-data in each cpu's cache.

    Take a route table for example.  Traditionally you lock a route table up
    by obtaining a global mutex.  But what if you duplicated the entire
    route table on each cpu in the system?  If the route table is a bottleneck
    in an MP system and only eats, say, 1MB of ram, then eating 8MB of ram
    on an 8cpu system with 8G of ram is not a big deal, and you suddenly have
    an entirely new situation where now route table lookups can be performed
    without any locking whatsoever at the cost of making route table updates
    (which have to be sent to all cpus) more expensive.  But if you are
    dealing with 10,000 route table lookups per second and maybe one route
    table update per second it's obvious that the lookups are the critical
    path.

    VM Objects and other infrastructure can be worked the same way, or worked
    in a hybrid fashion (e.g. only duplicate VM objects for threaded processes,
    for example).

:Another thing I was missing was a description on how the system call message
:passing algorithm was supposed to work. I assume it just does a int 0x80
:currently? Is there a fast path for sending messages to user space servers?
:
:Thanks,
:
:-Andi

    Well it isn't written yet, but yes, it will just do something like an
    INT 0x80.  You can't really fast-path user<->kernel transitions, you
    can only minimized the overhead involved after you've gotten past pushing
    and reloading the necessary registers and segment registers (and vise
    versa when leaving the kernel back into userland).  The unavoidable
    operations generally take at least 200ns to accomplish on a reasonably
    fast cpu.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>