HEADS UP - major structure size changes in HEAD

Wed Jun 9 11:02:10 PDT 2010

:I fully agree that tokens are a more convenient API than, say, lockmgr 
:locks, but AFAIK nobody has investigated their behavior under heavy 
:contention (I suspect you have Matt, so please share any information you 
:might have collected).

    Tokens and the big-giant-lock behave the same way when under contention.
    Both basically call lwkt_switch().  The LWKT scheduler then essentially
    spins looking for schedulable threads until it finds one for which it
    can restore the state of both the MP lock and the token(s).

:Right now, the lwkt_switch() routine is responsible for getting all 
:tokens that a thread needs to run. If it can't get them (because some 
:token is owned by a thread running on a different cpu), it moves on to 
:the next runnable thread.
:
:However, lwkt_getalltokens() tries to get the tokens in the same order 
:that they were acquired by the thread, which means that if you lose a 
:race to another thread that needs your Nth token, you just wasted time 
:on taking (N - 1) tokens, possibly obstructing other threads that could 
:use those tokens to move forward. I admit that in practice most threads 
:will take the same tokens in the same order, so I don't expect that to 
:be a big issue. Some concrete data would be nice though :)

    Well, theoretically getting the tokens in reverse order would more
    efficiently locate the contending token, but only if the token is
    being held for a long period of time by the other cpu.  If it is only
    being held for a short period of time (which is the normal case) then
    actually having the other cpu get its tokens in forward order can be
    more efficient because it adds a slight delay which allows the current
    cpu to finish its short-run code and release the token before the
    contending cpus try to acquire it again.

    Another problem with getting the tokens in reverse order is the
    potential for temporary livelocks when several cpus are trying to
    get the same set of tokens and some of them are doing it from mainline
    code in forward order while others have already contended and are doing
    it (in reverse order) from the scheduler.  This can cause them to meet
    in the middle and cause ALL the cpus to contend (nobody gets the full set
    of tokens).  This is temporary because eventually all the cpus hitting
    this situation wind up in the scheduler and try to get the tokens in
    the same order.  Plus other timing considerations prevent livelocks
    (though to be fair maybe I should add a contention counter to lwkt_switch
    and start adding some random delays if it takes too long to find a
    runnable thread).

    If you think about it the meet-in-the-middle collision can create a
    horrible cascading effect on systems with a lot of cpus, where everyone
    contends until all cpus in question wind up in the scheduler.

:Another thing that makes me sceptical about token scalability is that, 
:AFAICT, the scheduler needs to try to acquire them *every time*. So if 
:you have M threads on a cpu all waiting for, say, the VM token, then 
:while it is held on another cpu you'll try (and fail) to get that token 
:M times every time you lwkt_switch(). This is time you could have spent 
:running a task that potentially *could* get some work done (nevermind 
:the cmp. That is in contrast to sleep locks where you only become 
:runnable when a lock you wanted is released.
:
:I'm sure you've thought about all this; I suppose you're seeing tokens 
:as a good trade off between simplicity and scalability?
:
:Thanks,
:Aggelos

    Well, yes and no.  Yes it is spinning, but no it doesn't really reduce
    efficiency all that much because if no other kernel threads are runnable
    that cpu has nothing else to do anyway.

    If another kernel thread is runnable it winds up getting scheduled
    within around ~100ns.  If only the one contending kernel thread
    is runnable it spins in the scheduler and eventually resumes the
    original thread without actually having to do a switch.  Probably
    ~50ns or less.

    If many kernel threads are contending on the same token the requesting
    cpu has nothing to do until the holding cpu releases the token,
    no matter how many threads are stuck on the requesting cpu.  When the
    token is released by the holding cpu then the requesting cpu will
    instantly be able to schedule whatever thread (of the many contending
    threads) it is currently polling.

    So the scheduling overhead winds up being essentially O(1).

    There is one other thing I should note here about the LWKT scheduler.
    Neither the MP lock or the tokens do any sort of IPI signaling when
    a release occurs.   They act essentially like spin locks except for
    fact that where they spin is in the scheduler and not in the code
    trying to acquire the token.  IPI signaling would add a huge amount of
    extra overhead for locks which are typically held for only a few hundred
    nanoseconds (IPI signaling latency is on the order of 1uS so the
    notification winds up being wasted most of the time).

    This creates an issue with the scheduling of user threads when kernel
    threads are also present but contending on the MP lock or a token.
    If we schedule the user thread under these conditions the kernel threads
    wind up with severe latencies which destroys interactive responsiveness.

    Thus the LWKT scheduler currently will refuse to schedule a user thread
    if a runnable kernel thread is present which is contending on the
    MP lock or a token.

    The spin (spinlocks) or the hybrid spin-in-scheduler (MP lock, tokens)
    approach gives us a solution which has lower overheads for locks which
    are held for very short periods of time.

    The lockmgr/mutex design gives us a solution which has lower overheads
    for locks which are held for longer periods of time and/or must be held
    across blocking conditions.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>