token passing & IPIs

Matthew Dillon dillon at apollo.backplane.com
Thu Jul 17 18:57:37 PDT 2003


:>     The key to using an IPI properly is to use it asynchronously whenever
:>     possible... that is, to not have to wait for a reply to your IPI message.
:
:You will eat a lot of overhead from the interrupt processing
:when receiving one on the target CPU. You can probably make it a fast path
:(handler in pure assembler etc.), but it still will be somewhat costly.
:Just take a look at the pseudo code of "INT" in the Intel manual.
:It's several pages. Processing that just must be slow. And it even
:includes several locked cycles on the bus (IDT/GDT accesses are normally
:locked)

    First of all, the Intel documentation is showing you how interrupts
    are entered for 6 totally different situations.  Only one of those six
    is what actually gets executed... typically "interrupt-to-inner-privilege"
    or "interrupt-to-same-privilege".  The only difference between the two
    has to do with the additional stack information pushed for a ring
    transition.  Nearly all the checks you see written out in their gory
    detail or done in parallel in hardware.  The only thing that actually
    takes time is pushing some data onto the stack, roughly equivalent to
    a procedure call, and the vector fetch, which I am fairly sure does not
    use locked bus cycles (it's not modifying anything, there's no reason to).

    Secondly, you are seriously underestimating the cost of acquiring a 
    mutex.  An uncontested mutex acquisition and release on a 1GHz P3
    costs around 40ns if you assembly optimize it, and well over 300ns if
    a cache mastership change needs to occur.  That is not cheap,
    especially considering that (IPI) interrupt overhead is typically only
    around 1uS.  The threading model saves a number of mutex operations
    right off the bat, and a properly implemented token model can potentially
    save a lot more.  For example, take a pipe.  Using a token instead of a
    mutex to control a pipe can result in completely uncontested access on
    IA32 if the current token holder releases the the cpu on the other side
    of the pipe.  That is unbeatable.

    With a mutex you are at a minimum doing FOUR mutex operations.  cpu 1
    acquires, then releases, cpu 2 acquires, then releases, at a cost of
    around (300+40)*2 = 680ns on my duel cpu test box.  That same box can
    run a full duplex pipe test at 16.8uS per transaction (with the MP lock
    no less), so the mutex overhead *JUST* for access to the pipe structure
    is going to eat 4% of the performance.  If you add other mutex overheads,
    such as on the file descriptor array and so forth, it can cost a lot more.

    The premis of the token methodology is that one can completely avoid
    all locked bus cycles in the best case, and avoid most of the 'expensive'
    cases by design, and that the remaining expensive cases wind up costing
    less then all the mutex equivalents you would otherwise have had to do
    between expensive cases.  And if it turns out not to work out that way
    we can always treat certain highly contended tokens like mutexes where
    necessary, since a mutex is actually a degenerate case of a token (in
    otherwords, a token is a far more powerful vehicle then a mutex but if
    you really need to you can operate a token like a mutex).

:But they do not present a single image neither.
:
:If they have the same / every time you have a path name that starts
:with / you have to read lock the root inode and the cache where it is
:stored somehow. The same applies to other often accessed file objects.
:
:Unix unfortunately has some shared data by design that cannot be easily split.
:
:You could avoid the problem of shared root by using plan9 like name spaces,
:but that would be quite a radical design change and may not be very nice
:for the user.


    No, this is simply not true.  This has nothing whatsoever to do with
    'UNIX' per say, it is simply the caching algorithms you choose to use.
    There is nothing preventing us from duplicating the namei cache for the
    top two levels of the filesystem, which are virtually never modified.
    In fact, there is nothing preventing us from duplicating portions of
    the namei cache ON-DEMAND (ad-hoc) on a per-cpu basis.  What we are
    doing is implementing the equivalent of a MESI cache model locally 
    between cpus for the purposes of managing certain memory based caches,
    like the path cache.  The cost is that invalidating an entry requires an
    expensive interaction with all the cpus on the system that might have
    copied the entry, but the advantage of the local cache is virtually zero
    overhead and no locking operations at all.  This would be one of the
    easiest heuristics to write, too.

:As long as you have a single name space you need some amount of
:synchronization because there is single shared data.

    There is not necessarily a single copy of the shared data.  You
    can't avoid synchronization when modifying a potentially duplicated
    piece of shareable data, but you can certainly avoid synchronization
    when accessing it.

:>     This implies that it is possible to implement caching algorithms which
:>     are far more aware of an MP environment and make better use of per-cpu
:>     caches (which do not require locks), possibly even to the point of
:>     duplicating meta-data in each cpu's cache.
:
:You are probably refering to McVoy's cc/cluster concept here. But even
:that requires some amount of synchronization, e.g. for the single
:file name space.
    
    I have no idea what McVoy's cc/cluster concept is.  The only concept
    I am talking about is a standard MESI cache algorithm applied to software
    cached data rather then hardware.  Maybe that is the model he is using
    too.

:>     on an 8cpu system with 8G of ram is not a big deal, and you suddenly have
:>     an entirely new situation where now route table lookups can be performed
:>     without any locking whatsoever at the cost of making route table updates
:>     (which have to be sent to all cpus) more expensive.  But if you are
:
:Most modern TCPs allocate new routes on the fly to save path MTU discovery
:data. You don't want to do that duplication for each new connection.
:In theory you could save that information per CPU, but it may not
:be a very good trade off because sending packets with the wrong MTU
:can very costly in performance.

    I don't see the problem.  The only thing one manages here is memory.
    If one requires MP performance and one has enough memory, dedicating
    a few megabytes per cpu is not a big deal and a few megabytes can
    cache a huge amount of data... you could probably avoid 95% of the
    locking operations you would otherwise have to do on a global shared
    cache.

:Another  example would be the local ports hash table of your UDP and TCP.
:You could try to split it per CPU, but it would limit the user
:unduly (The 16bit TCP/UDP port space is a very limited resource, limiting
:it more than 16bits would be not very nice).

    Again, I don't see a problem here.  In a TCP connection one port is
    typically fixed and one is typically mobile.  This means that you can
    simply partition the connection set by (srcPort ^ dstPort) % N where N
    is the number of cpus you are running your TCP stack.   This simple
    partitioning scheme would handle 90% of the problem space statistically,
    without any effort and without any locking.

    But, more importantly, the algorithm focuses down into the port selection
    code which is one subroutine... and thus the best candidate for 
    optimization.  If a particular partitioning scheme does not work you try
    another.  You can make the algorithm simple (fixed partitioning), or
    complex (partitioning with controlled migration).  The flexibility of
    that sort of focus is amazing.

    The key interest here on my part is to identify the relatively few 
    subsystems generating the greatest contention between cpus and then
    implement a per-cpu cache model for those few subsystems.  I am not
    talking about implementing such a model for dozens of subsystems,
    which would take years.  It would only take fixing two or three key
    subsystems to reap 90% of the reward and in that light it is a very
    achievable goal.

:>     VM Objects and other infrastructure can be worked the same way, or worked
:>     in a hybrid fashion (e.g. only duplicate VM objects for threaded processes,
:>     for example).
:
:Just hope that you don't have an multithreaded application that uses
:all CPUs with a single address space then.
 
     You are seriously overestimating the cost of the duplication,
     especially if its localized to the partitioning scheme and on-demand.
     If performance is an absolute requirement for a particular set of 
     processes then throwing a little memory at that set to get rid of
     the majority of the locking you would otherwise have to do is a perfectly
     acceptable tradeoff.  I wager that even with the shared processes
     constantly mmap()ing and munmap()ing portions of the shared address
     space, that I could get rid of 90% of the contention.

     The cost of cache mastership changes from the point of view of whole
     process datasets is horrendous.   Take fork() for example.  If you
     initially run the forked child on the same cpu as the parent
     cache locality and mutex issues will give you four to five times
     the performance for just that first tick then if you put the forked
     process on a different cpu.  It actually winds up being cheaper to 
     timeshare the new process on the original cpu for one tick before
     moving it, and by then the new process may have exec()'d (as is 
     common after a fork).  On a 1GHz duel P3 this sort of operation takes
     around 150uS with the child going to the same cpu, and around 600uS 
     if the child starts on a different cpu.

     per-cpu caches operate in the same fashion.  Accessing a per-cpu 
     namecache can wind up being be far faster then accessing a shared
     namecache, even when you discount mutex overhead, because the data
     in each of those per-cpu caches will be taylored to the work being
     done on those cpus and you are guarenteed to have no cache
     mastership changes, and no unnecessary L1/L2 cache pollution from shared
     data being accessed by other cpus.  The rewards that can be reaped
     are enormous.

:Another related issue would be swapping. Sometimes it's a good idea
:to have an asynchronous swapper on another CPU clean up/age your
:address space, which you're working in it (it already costs IPIs
:for the remote TLB flushes, but that's a different issue). For that
:you also need a reasonable cheap way to access the data structures
:of the object that may be also used on the other CPU.  
:
:>     Well it isn't written yet, but yes, it will just do something like an
:>     INT 0x80.  You can't really fast-path user<->kernel transitions, you
:
:SYSENTER would be much faster on modern x86.
:
:-Andi

    You mean on a P4 which has absolutely abysmal ring change overheads...
    easily three or four times the overhead as a P3.  SYSENTER would cut
    that by roughly half on a P4, from 1300 ns to maybe 700ns.  SYSENTER
    won't do virtually anything on a P3 whos ring change overheads are
    roughly in the 500ns range already (you might get it down to 400ns).
    I'm including other overheads in there but that is roughly what you get
    out of it.  

    I have yet to ever find a case where shaving down syscall overheads
    makes any difference in a real application.  Besides, in DragonFly
    it is going to be irrelevant because with syscall messaging one will
    be able to queue up several system calls at once and make just a 
    single transition into the kernel to run them.  So for the one or 
    two applications that would actually benefit, a mechanism will be
    available that returns economies of scale far greater then SYSENTER
    would give you.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>





More information about the Kernel mailing list