HEADS UP - major structure size changes in HEAD

Wed Jun 9 16:22:38 PDT 2010

On 09/06/2010 08:57 Î¼Î¼, Matthew Dillon wrote:
:I fully agree that tokens are a more convenient API than, say, lockmgr
:locks, but AFAIK nobody has investigated their behavior under heavy
:contention (I suspect you have Matt, so please share any information you
:might have collected).
     Tokens and the big-giant-lock behave the same way when under contention.
     Both basically call lwkt_switch().  The LWKT scheduler then essentially
     spins looking for schedulable threads until it finds one for which it
     can restore the state of both the MP lock and the token(s).
:Right now, the lwkt_switch() routine is responsible for getting all
:tokens that a thread needs to run. If it can't get them (because some
:token is owned by a thread running on a different cpu), it moves on to
:the next runnable thread.
:
:However, lwkt_getalltokens() tries to get the tokens in the same order
:that they were acquired by the thread, which means that if you lose a
:race to another thread that needs your Nth token, you just wasted time
:on taking (N - 1) tokens, possibly obstructing other threads that could
:use those tokens to move forward. I admit that in practice most threads
:will take the same tokens in the same order, so I don't expect that to
:be a big issue. Some concrete data would be nice though :)
[description of issues with taking the tokens in reverse order]
>
     If you think about it the meet-in-the-middle collision can create a
     horrible cascading effect on systems with a lot of cpus, where everyone
     contends until all cpus in question wind up in the scheduler.
Indeed, which is why I specifically wrote that, since I expect most 
paths that contend for a set of tokens will try to take them in the same 
order, trying to take the tokens in forward order is the sane approach. 
There will only be an issue if that assumption is false, i.e. contending 
code paths do not try to take a set of tokens in the same order (then 
you may often have the meet-in-the-middle collision). I hope it will be 
mostly true in practice, but if it's not we'll definitely notice :)

:Another thing that makes me sceptical about token scalability is that,
:AFAICT, the scheduler needs to try to acquire them *every time*. So if
:you have M threads on a cpu all waiting for, say, the VM token, then
:while it is held on another cpu you'll try (and fail) to get that token
:M times every time you lwkt_switch(). This is time you could have spent
:running a task that potentially *could* get some work done (nevermind
:the cmp. That is in contrast to sleep locks where you only become
:runnable when a lock you wanted is released.
:
:I'm sure you've thought about all this; I suppose you're seeing tokens
:as a good trade off between simplicity and scalability?
[...]
     If many kernel threads are contending on the same token the requesting
     cpu has nothing to do until the holding cpu releases the token,
     no matter how many threads are stuck on the requesting cpu.  When the
     token is released by the holding cpu then the requesting cpu will
     instantly be able to schedule whatever thread (of the many contending
     threads) it is currently polling.
Err, I'm not seeing this. AFAICT if there are M threads that need token 
X to run, the lwkt_switch() will have to loop through all M threads. So 
the cpu has some work to do.

As a more realistic example, if another CPU owns the VM token and you 
have M + 1 runnable threads, M of which need the VM token and 1 which 
doesn't, on average (assumming random order on the run queue) you'll try 
to get the VM token M/2 times before finding the one thread that doesn't 
need it. The issue is that you don't get a notification when that token 
is released (and as you said IPIs have big latencies, so that's not 
really doable). And of course if M threads are waiting for the token, 
you really do want to try hard to get it; if you just kept track of busy 
tokens as you went through the runnable threads and avoided trying to 
acquire them a second time in the for_each_runnable_thread loop you'd 
incur much higher latencies. You could perhaps get ultra-clever in which 
token you try to take next (though I can't think of an obviously 
reasonable approach ATM), but this means more effort before each context 
switch.

[...]
     The spin (spinlocks) or the hybrid spin-in-scheduler (MP lock, tokens)
     approach gives us a solution which has lower overheads for locks which
     are held for very short periods of time.
The subsystem tokens will be coarse-grained at first so I'm not sure why 
e.g. the VM tokens can't be held for a considerable (if bounded) amount 
of time. This is not really a problem; our SMP scalability will improve 
considerably. My concern is the next step and whether lwkt_switch() can 
(or will be able to) efficiently juggle a large number of tokens. I'm 
not saying it *will* be a problem. It's just that, to me at least, it is 
not entirely obvious how the current algorithm would perform in that 
scenario.

I'm all for using tokens to lock down subsystems however.

Aggelos

     The lockmgr/mutex design gives us a solution which has lower overheads
     for locks which are held for longer periods of time and/or must be held
     across blocking conditions.