On SMP

Sat Jan 22 14:27:40 PST 2005

:Assuming all of the DragonFly Developer Team has read the interview
:http://www.onlamp.com/pub/a/bsd/2005/01/20/smpng.html to
:Scott Long on SMPng.
:
:What do you think about what he says on the question: What do you think
:about DragonFlyBSD's SMP technology? How does it compare to SMPng?
:
:And what will you answer to the same question? 
:
:Thanks!
:
:		-RaD|Tz

    Well, its a reasonable response to a loaded question.  Short and 
    diplomatic.  Not really correct, but that's understandable.  People
    like to put things into existing categories and its easy to try to put
    DFly into the Mach box, even though it is nothing like Mach.  We use
    a messaging approach, sure, but just because Mach also uses some sort
    of messaging doesn't equate us with Mach.

    The DragonFly approach has a lot of advantages but it would not
    really be fair to try to compare the implementation of our ideas
    against the FreeBSD's implementation of their ideas, at least not
    until we actually turn off the Big Diang Lock in DFly. 

    From a theoretical point of view, this is what I like about the
    DragonFly approach:

    * Simple coding rules make for less buggy, more readable, and
      more easily debugged code.  I discuss this a bit later on in
      this posting.

    * The threaded subsystem approach requires one to carefully consider
      the algorithmic implementation rather then simply trying to 
      lock-up code that was not originally designed for MP.  I think this
      will yield more efficient and more maintainable code.

    * The threaded subsystem approach has far less knowledge pollution 
      between subsystems, making coding easier and, again, making
      code maintainance easier.

    * The threaded subsystem approach has the advantage of being able to
      batch operations to take maximum advantage of the cpu's L1 and L2
      caches.  The partitioning effect coupled with proper algorithmic
      design should theoretically allow us to scale to multiple cpus far
      more easily then a mutex approach.  The batching effect also greatly
      reduces messaging overhead... that is, as system load increases,
      the messaging overhead decreases and the effectiveness of the L1/L2
      cache increases.  This is desireable, and it is also the exact
      reverse of what you get in a heavily mutexed approach.

    The mutex approach isn't bad, it's just that you have to be careful
    *WHAT* you place under a mutex and what you do not.  In my opinion,
    FreeBSD is placing too much emphasis on mutexing every little niggling
    thing.  For example, the process structure is very heavily mutexed
    in FreeBSD whereas in DragonFly the intention is never to mutex it, or
    much of it, in a fully MP system.

    This methodology has a direct relationship to the complexity of the
    coding rules.  In DragonFly, the rule is: someone or something owns 
    the structure and can access it without any further messing around.
    The process owns its own process structure and can access it directly.
    If a process wants to access some other process's process structure,
    it generally has to do it in the context of the cpu owning the target
    process (which in DragonFly terms means using the very simple IPI
    remote function execution API), or it has to send a message to the
    target process, depending on the subsystem.

    In FreeBSD the rule is basically: you MUST use a mutex if you access
    field blah blah, period end of story.  Seems simple, but not only do
    you have to know exactly which field requires which mutex, but if 95% of 
    the code is accessing the field from a fixed context, the DragonFly
    approach is far easier to code (no mutexes or messaging or anything at
    all required to access the field).

    Another example of this rule can be found in the network protocol stacks.
    The protocol stacks own their own structures (the PCBs in this case),
    and are able to manipulate them free and clear.  All other code in the
    system wanting to access elements of a PCB basically have to send a 
    message to the controlling network protocol thread to get at the 
    information.  Since the protocol code is doing 99% of the accesses to
    those structure, the overall coding is very straightforward and almost
    trivial.  Not only that, but we get free serialization of single-socket
    operations... and it turns out that serialization is necessary in
    nearly every case where a foreign process wants to do something to a
    socket/PCB anyway.

    Implementing this level of parallelism with a mutex based system is
    far more difficult because the only way to guarentee serialization is
    to actually hold the mutex over the entire operation... which means
    either you can't block, or you have to do something nasty (knowledge
    pollution, or using a lockmgr lock) if you might block.  But in DragonFly
    the network protocol stack gets the serialization for free AND it can
    block anywhere while processing a PCB without breaking the serialization.

    It is also important to note that one does not have to make every last
    line of OS code maximally efficient.  For example, who cares whether
    the sysctl variable interface takes 1uS to run or 10uS ?  Nobody, that's
    who.  I believe that non-critical subsystems coded with our approach
    allows the programmer to choose whether to do it in a straightforward,
    possibly less efficient manner, or whether to go full MP.  A heavily
    mutexed approach does not have that flexibility.

    An example of this would be our stats counter collection code.  In 
    DragonFly we implement a separate set of statistics counters on each
    cpu.  This allows the counters to be updated using non-locked and
    even non-atomic instructions, greatly simplifying the coding of those
    counters.  The sysctl code then aggregates the stats counters on all
    the cpus together and reports a single number to userland.  It does this
    by actually moving the user process to each cpu in the system in order
    to access that cpu's stats counter, in a circle until it gets back to the
    original cpu the user process was on.  This is obviously not efficient,
    but sysctl reporting to userland doesn't have to be and it is utterly
    trivial to code.

    -

    In anycase, this is why I feel we have a superior programming model
    (again, bias... I wouldn't have started DragonFly if I thought our
    model wasn't!).  Flexibility and scaleability without the complexity.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>