HEADS UP - major structure size changes in HEAD
Matthew Dillon
dillon at apollo.backplane.com
Wed Jun 9 19:20:10 PDT 2010
:As a more realistic example, if another CPU owns the VM token and you
:have M + 1 runnable threads, M of which need the VM token and 1 which
:doesn't, on average (assumming random order on the run queue) you'll try
:to get the VM token M/2 times before finding the one thread that doesn't
:need it. The issue is that you don't get a notification when that token
:is released (and as you said IPIs have big latencies, so that's not
:really doable). And of course if M threads are waiting for the token,
:you really do want to try hard to get it; if you just kept track of busy
:tokens as you went through the runnable threads and avoided trying to
:acquire them a second time in the for_each_runnable_thread loop you'd
:incur much higher latencies. You could perhaps get ultra-clever in which
:token you try to take next (though I can't think of an obviously
:reasonable approach ATM), but this means more effort before each context
:switch.
Well, my point here is that if the VM token is held for only a short
period of time on the other cpu, then while the first cpu is running
through the M threads the VM token is likely to be released during that
period and the first cpu will be able to schedule something nearly
instantly when that occurs, without necessarily having to have gone
through the entire list of runnable threads.
Lets say the VM token is held for a longer period of time, then the
first cpu runs through and is unable to schedule M threads before it
finds the thread that it CAN schedule. That thread runs for a certain
period of time before returning to the scheduler by which time the VM
token has likely been released.
Also the threads it is skipping are threads it wanted to run that
would have run if only some other cpu had not been holding the
required token. If this full scan only occurs a limited number of
times the overhead will be significantly less verses a sleep/wakeup
model.
If the tsleep/wakeup takes 2uS per thread going to sleep then we can
literally waste 2uS per thread in the scheduler doing the polling before
we lose to the the sleep/wakeup model. 2uS per thread is a very long
time. Lets say the overhead per thread for the scan is 100uS (which is
being generous). The full scan by the first cpu could then occur 20
times (i.e. 20 * M threads) before it matches the overhead of the
sleep/wakeup model.
:[...]
:> The spin (spinlocks) or the hybrid spin-in-scheduler (MP lock, tokens)
:> approach gives us a solution which has lower overheads for locks which
:> are held for very short periods of time.
:
:The subsystem tokens will be coarse-grained at first so I'm not sure why
:e.g. the VM tokens can't be held for a considerable (if bounded) amount
:of time. This is not really a problem; our SMP scalability will improve
:considerably. My concern is the next step and whether lwkt_switch() can
:(or will be able to) efficiently juggle a large number of tokens. I'm
:not saying it *will* be a problem. It's just that, to me at least, it is
:not entirely obvious how the current algorithm would perform in that
:scenario.
:
:I'm all for using tokens to lock down subsystems however.
:
:Aggelos
For the most part these tokens are not going to be held for more than
1uS or so, and often quite a bit less then that. This is because
most kernel operations only take long periods of time because they
actually block, and of course the token will be released if the thread
blocks.
Take a vm_fault for example. The fault occurs and then blocks on I/O.
That might be 200ns. Then later on the I/O completes and the thread
wakes up and reacquires the token, then spends another 200ns with the
token held before returning from the fault.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Kernel
mailing list