token passing & IPIs
Matthew Dillon
dillon at apollo.backplane.com
Thu Jul 17 12:47:42 PDT 2003
:Hi Matt,
:
:I stumbled over your DragonFly site and read the design information
:with interest. The design looks a bit AmigaOS inspired to me, except
:that you use multithreaded servers.
:
:One thing I wanted to comment on is your locking design using token
:passing primitives. It says you want to use IPIs to pass tokens
:around. One problem I see with that is that IPIs can be rather slow.
:Last time I measured them one IPI on a fairly fast system (4 CPU
:Opteron) was around ~1000 cycles from sending to single CPU entry IPI
:handler [this was on Linux/x86-64, but that already has a quite well
:optimized path for it]. I'm not sure where the overhead comes from,
:I suspect it adds together from all the microcode that needs
:to be executed. On P4 based systems I would expect it to be slower,
:they are very slow at anything exceptional that disrupts their trace
:cache.
I am CCing this response to my list for archival purposes, because I
wound up spending a good hour formulating it. Good questions!
Yes, IPIs are very expensive. An IPI can take 1uS or longer.
The problem is that the inter-APIC bus between CPUs is (1) slow and
(2) the APIC itself is a microprocessor from what I can tell and doesn't
run instructions nearly as fast as the main cpu does.
The key to using an IPI properly is to use it asynchronously whenever
possible... that is, to not have to wait for a reply to your IPI message.
A great deal of DragonFly's IPI operations are asynchronous in this
manner. The LWKT scheduler for example is asynchronous. The token code
is currently synchronous and thus non-optimal but there are plenty of
optimizations that can be made to reduce non-optimal instances, such as
proactively handing ownership of a token off to another cpu (which
can remove 99.9% of the synchronous IPIs that occur in a typical pipe
device, for example).
I'm not really worried about latency in the asynchronous case because
the source cpu does not have to wait for delivery or response... it's
a matter of a few instructions and then the source cpu goes on to other
things. Sure, the source cpu might occassionally have to spin waiting
for the previous IPI message it sent to be delivered before it can
queue the next one, but that is pipelined fairly well by the APIC
and doesn't normally happen in the critical (performance-killing) path.
:A lot of IPIs can even live-lock a CPU, I've seen that problem with remote
:TLB flushes on some extreme workloads.
This is a design issue only. Basically it comes down to the fact
that interrupt disablement (cli) will also block IPIs, and that if you
take an IPI which itself can send an IPI you have to be sure to enable
interrupts while you are looping in order to allow incoming IPIs to
at least be flagged and EOI'd (though not actually run as this could
lead to a live-locked recursion). Livelock can only happen if you spin
on an outgoing IPI request without handling incoming IPIs. At least that
is my understanding. I could be wrong.
:The IPI was non NMI, so it was possible that some CLI delays on the
:target CPU were included, but I did take quite a lot of samples and it
:was always in that range. I suspect especially on the older Intel
:boxes (P3) it is even slower, because they have a rather underpowered
:APIC bus instead of the fast HyperTransport interconnect of the
:Opteron.
Pipelining IPI requests is critical (i.e. starting an IPI command and
not waiting for it to be accepted before returning, then spinning
on the previous IPI command pending bit when the next one is sent).
:Passing a cache line in a spinlock around is much less costly. I
:understand you want to use much less locks than a fine grained locked
:kernel (a worthy goal; e.g. Linux can already take >20 locks for a
:single write call and it hurts). But some locks cannot be avoided,
:like the protection for lookups in buffer or directory caches, which
:cannot be handled CPU local because they need to manage system global
:information. With a locking primitive that has so much more overhead
:you will be at a great disadvantage in these important fast paths.
:
:Have you put any thoughts into these issues yet?
Well, I disagree somewhat on locks not being avoidable. Consider the
case where you have two entirely independant machines operating in a
cluster. These machines obviously do not have to get low level locks
to protect each from the other, mainly because each machine has its
own entirely independant set of caches.
This implies that it is possible to implement caching algorithms which
are far more aware of an MP environment and make better use of per-cpu
caches (which do not require locks), possibly even to the point of
duplicating meta-data in each cpu's cache.
Take a route table for example. Traditionally you lock a route table up
by obtaining a global mutex. But what if you duplicated the entire
route table on each cpu in the system? If the route table is a bottleneck
in an MP system and only eats, say, 1MB of ram, then eating 8MB of ram
on an 8cpu system with 8G of ram is not a big deal, and you suddenly have
an entirely new situation where now route table lookups can be performed
without any locking whatsoever at the cost of making route table updates
(which have to be sent to all cpus) more expensive. But if you are
dealing with 10,000 route table lookups per second and maybe one route
table update per second it's obvious that the lookups are the critical
path.
VM Objects and other infrastructure can be worked the same way, or worked
in a hybrid fashion (e.g. only duplicate VM objects for threaded processes,
for example).
:Another thing I was missing was a description on how the system call message
:passing algorithm was supposed to work. I assume it just does a int 0x80
:currently? Is there a fast path for sending messages to user space servers?
:
:Thanks,
:
:-Andi
Well it isn't written yet, but yes, it will just do something like an
INT 0x80. You can't really fast-path user<->kernel transitions, you
can only minimized the overhead involved after you've gotten past pushing
and reloading the necessary registers and segment registers (and vise
versa when leaving the kernel back into userland). The unavoidable
operations generally take at least 200ns to accomplish on a reasonably
fast cpu.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list