HEADS UP - major structure size changes in HEAD

Wed Jun 9 19:20:10 PDT 2010

:As a more realistic example, if another CPU owns the VM token and you 
:have M + 1 runnable threads, M of which need the VM token and 1 which 
:doesn't, on average (assumming random order on the run queue) you'll try 
:to get the VM token M/2 times before finding the one thread that doesn't 
:need it. The issue is that you don't get a notification when that token 
:is released (and as you said IPIs have big latencies, so that's not 
:really doable). And of course if M threads are waiting for the token, 
:you really do want to try hard to get it; if you just kept track of busy 
:tokens as you went through the runnable threads and avoided trying to 
:acquire them a second time in the for_each_runnable_thread loop you'd 
:incur much higher latencies. You could perhaps get ultra-clever in which 
:token you try to take next (though I can't think of an obviously 
:reasonable approach ATM), but this means more effort before each context 
:switch.

    Well, my point here is that if the VM token is held for only a short
    period of time on the other cpu, then while the first cpu is running
    through the M threads the VM token is likely to be released during that
    period and the first cpu will be able to schedule something nearly
    instantly when that occurs, without necessarily having to have gone
    through the entire list of runnable threads.

    Lets say the VM token is held for a longer period of time, then the
    first cpu runs through and is unable to schedule M threads before it
    finds the thread that it CAN schedule.  That thread runs for a certain
    period of time before returning to the scheduler by which time the VM
    token has likely been released.

    Also the threads it is skipping are threads it wanted to run that
    would have run if only some other cpu had not been holding the
    required token.  If this full scan only occurs a limited number of
    times the overhead will be significantly less verses a sleep/wakeup
    model.

    If the tsleep/wakeup takes 2uS per thread going to sleep then we can
    literally waste 2uS per thread in the scheduler doing the polling before
    we lose to the the sleep/wakeup model.  2uS per thread is a very long
    time.  Lets say the overhead per thread for the scan is 100uS (which is
    being generous).  The full scan by the first cpu could then occur 20
    times (i.e. 20 * M threads) before it matches the overhead of the
    sleep/wakeup model.

:[...]
:>      The spin (spinlocks) or the hybrid spin-in-scheduler (MP lock, tokens)
:>      approach gives us a solution which has lower overheads for locks which
:>      are held for very short periods of time.
:
:The subsystem tokens will be coarse-grained at first so I'm not sure why 
:e.g. the VM tokens can't be held for a considerable (if bounded) amount 
:of time. This is not really a problem; our SMP scalability will improve 
:considerably. My concern is the next step and whether lwkt_switch() can 
:(or will be able to) efficiently juggle a large number of tokens. I'm 
:not saying it *will* be a problem. It's just that, to me at least, it is 
:not entirely obvious how the current algorithm would perform in that 
:scenario.
:
:I'm all for using tokens to lock down subsystems however.
:
:Aggelos

    For the most part these tokens are not going to be held for more than
    1uS or so, and often quite a bit less then that.  This is because
    most kernel operations only take long periods of time because they
    actually block, and of course the token will be released if the thread
    blocks.

    Take a vm_fault for example.  The fault occurs and then blocks on I/O.
    That might be 200ns.  Then later on the I/O completes and the thread
    wakes up and reacquires the token, then spends another 200ns with the
    token held before returning from the fault.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>