token passing & IPIs
Matthew Dillon
dillon at apollo.backplane.com
Thu Jul 17 18:57:37 PDT 2003
:> The key to using an IPI properly is to use it asynchronously whenever
:> possible... that is, to not have to wait for a reply to your IPI message.
:
:You will eat a lot of overhead from the interrupt processing
:when receiving one on the target CPU. You can probably make it a fast path
:(handler in pure assembler etc.), but it still will be somewhat costly.
:Just take a look at the pseudo code of "INT" in the Intel manual.
:It's several pages. Processing that just must be slow. And it even
:includes several locked cycles on the bus (IDT/GDT accesses are normally
:locked)
First of all, the Intel documentation is showing you how interrupts
are entered for 6 totally different situations. Only one of those six
is what actually gets executed... typically "interrupt-to-inner-privilege"
or "interrupt-to-same-privilege". The only difference between the two
has to do with the additional stack information pushed for a ring
transition. Nearly all the checks you see written out in their gory
detail or done in parallel in hardware. The only thing that actually
takes time is pushing some data onto the stack, roughly equivalent to
a procedure call, and the vector fetch, which I am fairly sure does not
use locked bus cycles (it's not modifying anything, there's no reason to).
Secondly, you are seriously underestimating the cost of acquiring a
mutex. An uncontested mutex acquisition and release on a 1GHz P3
costs around 40ns if you assembly optimize it, and well over 300ns if
a cache mastership change needs to occur. That is not cheap,
especially considering that (IPI) interrupt overhead is typically only
around 1uS. The threading model saves a number of mutex operations
right off the bat, and a properly implemented token model can potentially
save a lot more. For example, take a pipe. Using a token instead of a
mutex to control a pipe can result in completely uncontested access on
IA32 if the current token holder releases the the cpu on the other side
of the pipe. That is unbeatable.
With a mutex you are at a minimum doing FOUR mutex operations. cpu 1
acquires, then releases, cpu 2 acquires, then releases, at a cost of
around (300+40)*2 = 680ns on my duel cpu test box. That same box can
run a full duplex pipe test at 16.8uS per transaction (with the MP lock
no less), so the mutex overhead *JUST* for access to the pipe structure
is going to eat 4% of the performance. If you add other mutex overheads,
such as on the file descriptor array and so forth, it can cost a lot more.
The premis of the token methodology is that one can completely avoid
all locked bus cycles in the best case, and avoid most of the 'expensive'
cases by design, and that the remaining expensive cases wind up costing
less then all the mutex equivalents you would otherwise have had to do
between expensive cases. And if it turns out not to work out that way
we can always treat certain highly contended tokens like mutexes where
necessary, since a mutex is actually a degenerate case of a token (in
otherwords, a token is a far more powerful vehicle then a mutex but if
you really need to you can operate a token like a mutex).
:But they do not present a single image neither.
:
:If they have the same / every time you have a path name that starts
:with / you have to read lock the root inode and the cache where it is
:stored somehow. The same applies to other often accessed file objects.
:
:Unix unfortunately has some shared data by design that cannot be easily split.
:
:You could avoid the problem of shared root by using plan9 like name spaces,
:but that would be quite a radical design change and may not be very nice
:for the user.
No, this is simply not true. This has nothing whatsoever to do with
'UNIX' per say, it is simply the caching algorithms you choose to use.
There is nothing preventing us from duplicating the namei cache for the
top two levels of the filesystem, which are virtually never modified.
In fact, there is nothing preventing us from duplicating portions of
the namei cache ON-DEMAND (ad-hoc) on a per-cpu basis. What we are
doing is implementing the equivalent of a MESI cache model locally
between cpus for the purposes of managing certain memory based caches,
like the path cache. The cost is that invalidating an entry requires an
expensive interaction with all the cpus on the system that might have
copied the entry, but the advantage of the local cache is virtually zero
overhead and no locking operations at all. This would be one of the
easiest heuristics to write, too.
:As long as you have a single name space you need some amount of
:synchronization because there is single shared data.
There is not necessarily a single copy of the shared data. You
can't avoid synchronization when modifying a potentially duplicated
piece of shareable data, but you can certainly avoid synchronization
when accessing it.
:> This implies that it is possible to implement caching algorithms which
:> are far more aware of an MP environment and make better use of per-cpu
:> caches (which do not require locks), possibly even to the point of
:> duplicating meta-data in each cpu's cache.
:
:You are probably refering to McVoy's cc/cluster concept here. But even
:that requires some amount of synchronization, e.g. for the single
:file name space.
I have no idea what McVoy's cc/cluster concept is. The only concept
I am talking about is a standard MESI cache algorithm applied to software
cached data rather then hardware. Maybe that is the model he is using
too.
:> on an 8cpu system with 8G of ram is not a big deal, and you suddenly have
:> an entirely new situation where now route table lookups can be performed
:> without any locking whatsoever at the cost of making route table updates
:> (which have to be sent to all cpus) more expensive. But if you are
:
:Most modern TCPs allocate new routes on the fly to save path MTU discovery
:data. You don't want to do that duplication for each new connection.
:In theory you could save that information per CPU, but it may not
:be a very good trade off because sending packets with the wrong MTU
:can very costly in performance.
I don't see the problem. The only thing one manages here is memory.
If one requires MP performance and one has enough memory, dedicating
a few megabytes per cpu is not a big deal and a few megabytes can
cache a huge amount of data... you could probably avoid 95% of the
locking operations you would otherwise have to do on a global shared
cache.
:Another example would be the local ports hash table of your UDP and TCP.
:You could try to split it per CPU, but it would limit the user
:unduly (The 16bit TCP/UDP port space is a very limited resource, limiting
:it more than 16bits would be not very nice).
Again, I don't see a problem here. In a TCP connection one port is
typically fixed and one is typically mobile. This means that you can
simply partition the connection set by (srcPort ^ dstPort) % N where N
is the number of cpus you are running your TCP stack. This simple
partitioning scheme would handle 90% of the problem space statistically,
without any effort and without any locking.
But, more importantly, the algorithm focuses down into the port selection
code which is one subroutine... and thus the best candidate for
optimization. If a particular partitioning scheme does not work you try
another. You can make the algorithm simple (fixed partitioning), or
complex (partitioning with controlled migration). The flexibility of
that sort of focus is amazing.
The key interest here on my part is to identify the relatively few
subsystems generating the greatest contention between cpus and then
implement a per-cpu cache model for those few subsystems. I am not
talking about implementing such a model for dozens of subsystems,
which would take years. It would only take fixing two or three key
subsystems to reap 90% of the reward and in that light it is a very
achievable goal.
:> VM Objects and other infrastructure can be worked the same way, or worked
:> in a hybrid fashion (e.g. only duplicate VM objects for threaded processes,
:> for example).
:
:Just hope that you don't have an multithreaded application that uses
:all CPUs with a single address space then.
You are seriously overestimating the cost of the duplication,
especially if its localized to the partitioning scheme and on-demand.
If performance is an absolute requirement for a particular set of
processes then throwing a little memory at that set to get rid of
the majority of the locking you would otherwise have to do is a perfectly
acceptable tradeoff. I wager that even with the shared processes
constantly mmap()ing and munmap()ing portions of the shared address
space, that I could get rid of 90% of the contention.
The cost of cache mastership changes from the point of view of whole
process datasets is horrendous. Take fork() for example. If you
initially run the forked child on the same cpu as the parent
cache locality and mutex issues will give you four to five times
the performance for just that first tick then if you put the forked
process on a different cpu. It actually winds up being cheaper to
timeshare the new process on the original cpu for one tick before
moving it, and by then the new process may have exec()'d (as is
common after a fork). On a 1GHz duel P3 this sort of operation takes
around 150uS with the child going to the same cpu, and around 600uS
if the child starts on a different cpu.
per-cpu caches operate in the same fashion. Accessing a per-cpu
namecache can wind up being be far faster then accessing a shared
namecache, even when you discount mutex overhead, because the data
in each of those per-cpu caches will be taylored to the work being
done on those cpus and you are guarenteed to have no cache
mastership changes, and no unnecessary L1/L2 cache pollution from shared
data being accessed by other cpus. The rewards that can be reaped
are enormous.
:Another related issue would be swapping. Sometimes it's a good idea
:to have an asynchronous swapper on another CPU clean up/age your
:address space, which you're working in it (it already costs IPIs
:for the remote TLB flushes, but that's a different issue). For that
:you also need a reasonable cheap way to access the data structures
:of the object that may be also used on the other CPU.
:
:> Well it isn't written yet, but yes, it will just do something like an
:> INT 0x80. You can't really fast-path user<->kernel transitions, you
:
:SYSENTER would be much faster on modern x86.
:
:-Andi
You mean on a P4 which has absolutely abysmal ring change overheads...
easily three or four times the overhead as a P3. SYSENTER would cut
that by roughly half on a P4, from 1300 ns to maybe 700ns. SYSENTER
won't do virtually anything on a P3 whos ring change overheads are
roughly in the 500ns range already (you might get it down to 400ns).
I'm including other overheads in there but that is roughly what you get
out of it.
I have yet to ever find a case where shaving down syscall overheads
makes any difference in a real application. Besides, in DragonFly
it is going to be irrelevant because with syscall messaging one will
be able to queue up several system calls at once and make just a
single transition into the kernel to run them. So for the one or
two applications that would actually benefit, a mechanism will be
available that returns economies of scale far greater then SYSENTER
would give you.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list