SPL vs. Critical Section vs. Mutexing for device synchronisation

Fri Jun 3 10:45:57 PDT 2005

:
:Hi Matt, hi all,
:as I said before, we should not start to fall into an adhoc-change mode,
:but think carefully about what we want and need first. I want to
:describe the advantages and problems of the mechanisms used here first
:and what I want to have afterwards.

    There are certain things we HAVE to do, regardless of where we 
    want to end up.

    Both SPLs and critical sections work only on the local cpu, which 
    means that as a general interrupt protection mechanism they only
    really work when the BGL (Big Giant Lock) is being held.  This is
    the case with the current system.

    We obviously want to get rid of the BGL.  At a minimum this means that
    SPLs will no longer function.  Therefore we have to get rid of SPLs.  

    Without the BGL critical sections still serve a purpose... they
    interlock against interrupts on the local cpu.  More specifically,
    they interlock against *IPI* interrupts on the local cpu and thus
    still represent an excellent mechanism for interlocking in subsystems
    which use cpu-local threading (such as our networking subsystem),
    or need to protect cpu-local variables (such as our LWKT threading
    subsystem).

    No matter what we do we have to get rid of SPLs.   We have a 
    chicken-and-egg problem here.  SPLs have to go in order to progress
    towards our MP goals, but we have not yet rewritten the hundreds of
    drivers that depend on SPLs.  Yet they have to go... chicken and
    egg.  We can't do it all in one go so we have to take a baby step and
    the first baby step is to replace SPLs with critical sections.

    This isn't an adhoc change... SPLs have to go, period.  But we can't
    achieve a lockless goal all in one step.  It simply isn't possible.
    The system is too complex.  This change allows us to consolidate the
    code and remove the SPL support from the interrupt paths entirely. 
    Since SPLs won't work in an SMP environment without the Big Giant Lock,
    this represents considerable forward progress.  Once SPLs are gone I
    will be able to remove easily a hundred lines of code or more of 
    hybrid C and assembly. The only assembly left will be the assembly that
    handles critical sections, and since critical sections are necessary 
    even once the BGL is removed the removal of the SPL and CPL checks
    will repreesnt a major cleanup of our low level interrupt and preemption
    code and serious forward progress towards our goals.

:(A) Defered interrupt processing via critical sections
:The primary advantage of critical sections is the simplicity. They
:are very easy to use and as long as there is no actual interrupt
:they are very cheap too. As a result they work best when used over
:short periods of time. Interrupt processing can be implemented either
:via a FIFO model via interrupt masks.

    This only works because we hold the BGL.  Without the BGL critical
    sections cannot be used to protect against interrupts.  Therefore,
    while the interrupt subsystem is able to depend on critical sections
    now, IT WON'T BE ABLE TO IN THE FUTURE.  A new MP-SAFE API is needed.

    At the moment I have created a mutex-like (locked bus cycle) 
    serialization API that is MP safe.  The abstraction is general enough
    that we should be able to replace the internals with something 
    better (aka lockless) in the future.  But right now it's the only
    thing we have which is inter-cpu safe.

:The down-side of critical sections is the coarse granularity making
:it unsuitable for any thing not taking a short period of term. Similiar
:issues to the mutex concept below apply. It also means we have to be
:on the same CPU as the interrupt.

    But this isn't necessarily true.  What really matters here is 

    * What percentage of the time is a cpu holding a critical section, and

    * What percentage of interrupts are being delayed due to code being in
      a critical section.

    Critical sections are not often held for long periods of time.  Sure
    there are a few exceptions, but even 'long' procedural paths such as
    malloc() or lwkt_switch() typically only hold a critical section for
    a microsecond or two.  The paths we really care about are the ones
    that either hold a critical section for a very long period of time,
    such as through a DELAY() call, or processing loops (such as in CAM
    or in device drivers) which can potentially process hundreds or thousands
    of events and thus take an unbounded amount of time with a critical
    section held.  Processing loops, at least, can be dealt with by adding
    an splz() call in the loop, but still cause additional interrupt
    overhead to be taken to delay the interrupt.

:(B) Per-device mutex
:This is the model choosen by FreeBSD and Linux. Ignoring dead-locks,
:this is actually very simple to use too. When ever the device-specific
:code is entered, the mutex is acquired and released when it is left.
:
:The down-side of this are two-fold. First of all it does require *two*
:bus-locked instruction, which is quite expensive especially under SMP.
:This holds true independent of whether the mutex is contested or not.
:The second big problem is that it can dramatically increase the interrupt
:latency. (Just like long-term critical section). The results has been
:measured for the Linux and FreeBSD implementation and are the one reason
:for the preemption mess they have.

    Yes, locked bus cycle instructions can potentially be very expensive.
    But there are only a limited number of ways to get around it.  In
    the DragonFly model the only way we can get around a locked bus cycle
    is to use cpu localization to turn the lock into a critical section.

    This means that all operations have to execute on the same cpu.  

    When a device driver is interacting between its upper and lower layers
    you have to remember that the upper layers can be called from any 
    process and thus any cpu.  If that layer must interact with a lower
    layer the only way to do it is to either use a locked bus cycle or
    to use an IPI message to forward the operation to the same cpu that
    the interrupt is bound to.

:(C) Defered interrupt processing via SPL masks
:This is the mutual exclusion mechanism tradionally used by the BSDs.
:It allows certain device classes to be serialised at once, e.g. to
:protect the network stack from infering with the network drivers.
:Currently in use are splvm (anything but timer), splbio (for block devices)
:and splnet (for network drivers). The nice part of this approach is that
:it has a similiar performance as critical sections on UP, but is finer
:grained.
:
:The down-side is the big complexity for managing the masks. It is also
:more course-grained than it often has to be.

    The down side is that it doesn't work in an SMP environment unless
    you are holding the Big Giant Lock.

:Conclusion: I'd like to have two basic mechanisms in the tree:
:(a) critical sections for *short* sections
:(b) *per-device* interrupt deferal
:
:...
:Joerg

    I like the idea of per-device interrupt deferal but you have to 
    realize that in an SMP environment this almost certainly requires
    the use of locked bus cycle instructions.

    In the case where the entity wishing to defer a particular device
    interrupt is on a different cpu from the interrupt handler we have
    two race conditions we have to deal with:

    (1) cache coherency race condition with entity A attempting to mark
	the interrupt for deferal simultaniously with the interrupt
	handler beginning execution and marking the interrupt is being
	in-progress.

    (2) threading race where the interrupt handler is ALREADY RUNNING on
	another cpu and the entity wishes to defer execution of the
	handler.  In this case the entity must spin or block until the
	handler has finished running.

    Frankly I don't see how we can possibly avoid the use of a 
    locked bus cycle instruction for either case.  The only way to truely
    avoid locked bus cycles is through cpu localization (aka using IPIs
    to execute all related operations on the same cpu).

    At the moment, per-device interrupt deferal is achieved through the
    serialization API that I committed a few days ago, which the IF_EM 
    driver is now using.

    The use of cpu localization is one of DragonFly's major goals.  I
    *WANT* to use a cpu localization mechanism whenever possible, and indeed
    we are using such a mechanism in our networking code with incredible
    results.  But that doesn't mean that cpu localization works in every
    case.  FreeBSD is using mutexes for just about everything.  We may 
    still need to use mutexes but when we do it won't be for everything, it
    will only be for those things that cannot be efficiently implemented
    with cpu localization.  Interrupt interlocks could very well be one of 
    those things for which cpu localization is not the best solution.

    In anycase, these are all very complex issues.  Even if we come up with
    a clean solution, we can't go from step A to step Z in a single step. 
    We still have to take baby steps to achieve our goals.  And that is 
    what we are doing right now by removing SPLs.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>