On SMP
Matthew Dillon
dillon at apollo.backplane.com
Sat Jan 22 14:27:40 PST 2005
:Assuming all of the DragonFly Developer Team has read the interview
:http://www.onlamp.com/pub/a/bsd/2005/01/20/smpng.html to
:Scott Long on SMPng.
:
:What do you think about what he says on the question: What do you think
:about DragonFlyBSD's SMP technology? How does it compare to SMPng?
:
:And what will you answer to the same question?
:
:Thanks!
:
: -RaD|Tz
Well, its a reasonable response to a loaded question. Short and
diplomatic. Not really correct, but that's understandable. People
like to put things into existing categories and its easy to try to put
DFly into the Mach box, even though it is nothing like Mach. We use
a messaging approach, sure, but just because Mach also uses some sort
of messaging doesn't equate us with Mach.
The DragonFly approach has a lot of advantages but it would not
really be fair to try to compare the implementation of our ideas
against the FreeBSD's implementation of their ideas, at least not
until we actually turn off the Big Diang Lock in DFly.
From a theoretical point of view, this is what I like about the
DragonFly approach:
* Simple coding rules make for less buggy, more readable, and
more easily debugged code. I discuss this a bit later on in
this posting.
* The threaded subsystem approach requires one to carefully consider
the algorithmic implementation rather then simply trying to
lock-up code that was not originally designed for MP. I think this
will yield more efficient and more maintainable code.
* The threaded subsystem approach has far less knowledge pollution
between subsystems, making coding easier and, again, making
code maintainance easier.
* The threaded subsystem approach has the advantage of being able to
batch operations to take maximum advantage of the cpu's L1 and L2
caches. The partitioning effect coupled with proper algorithmic
design should theoretically allow us to scale to multiple cpus far
more easily then a mutex approach. The batching effect also greatly
reduces messaging overhead... that is, as system load increases,
the messaging overhead decreases and the effectiveness of the L1/L2
cache increases. This is desireable, and it is also the exact
reverse of what you get in a heavily mutexed approach.
The mutex approach isn't bad, it's just that you have to be careful
*WHAT* you place under a mutex and what you do not. In my opinion,
FreeBSD is placing too much emphasis on mutexing every little niggling
thing. For example, the process structure is very heavily mutexed
in FreeBSD whereas in DragonFly the intention is never to mutex it, or
much of it, in a fully MP system.
This methodology has a direct relationship to the complexity of the
coding rules. In DragonFly, the rule is: someone or something owns
the structure and can access it without any further messing around.
The process owns its own process structure and can access it directly.
If a process wants to access some other process's process structure,
it generally has to do it in the context of the cpu owning the target
process (which in DragonFly terms means using the very simple IPI
remote function execution API), or it has to send a message to the
target process, depending on the subsystem.
In FreeBSD the rule is basically: you MUST use a mutex if you access
field blah blah, period end of story. Seems simple, but not only do
you have to know exactly which field requires which mutex, but if 95% of
the code is accessing the field from a fixed context, the DragonFly
approach is far easier to code (no mutexes or messaging or anything at
all required to access the field).
Another example of this rule can be found in the network protocol stacks.
The protocol stacks own their own structures (the PCBs in this case),
and are able to manipulate them free and clear. All other code in the
system wanting to access elements of a PCB basically have to send a
message to the controlling network protocol thread to get at the
information. Since the protocol code is doing 99% of the accesses to
those structure, the overall coding is very straightforward and almost
trivial. Not only that, but we get free serialization of single-socket
operations... and it turns out that serialization is necessary in
nearly every case where a foreign process wants to do something to a
socket/PCB anyway.
Implementing this level of parallelism with a mutex based system is
far more difficult because the only way to guarentee serialization is
to actually hold the mutex over the entire operation... which means
either you can't block, or you have to do something nasty (knowledge
pollution, or using a lockmgr lock) if you might block. But in DragonFly
the network protocol stack gets the serialization for free AND it can
block anywhere while processing a PCB without breaking the serialization.
It is also important to note that one does not have to make every last
line of OS code maximally efficient. For example, who cares whether
the sysctl variable interface takes 1uS to run or 10uS ? Nobody, that's
who. I believe that non-critical subsystems coded with our approach
allows the programmer to choose whether to do it in a straightforward,
possibly less efficient manner, or whether to go full MP. A heavily
mutexed approach does not have that flexibility.
An example of this would be our stats counter collection code. In
DragonFly we implement a separate set of statistics counters on each
cpu. This allows the counters to be updated using non-locked and
even non-atomic instructions, greatly simplifying the coding of those
counters. The sysctl code then aggregates the stats counters on all
the cpus together and reports a single number to userland. It does this
by actually moving the user process to each cpu in the system in order
to access that cpu's stats counter, in a circle until it gets back to the
original cpu the user process was on. This is obviously not efficient,
but sysctl reporting to userland doesn't have to be and it is utterly
trivial to code.
-
In anycase, this is why I feel we have a superior programming model
(again, bias... I wouldn't have started DragonFly if I thought our
model wasn't!). Flexibility and scaleability without the complexity.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Users
mailing list