[GSOC] Add SMT/HT awareness to DragonFlyBSD scheduler

Fri Mar 30 12:27:05 PDT 2012

This is one of the better emails on this subject I've seen in a long time.
Very nice to see the amount of work you have put into this so far.  If I
may, I'd love to make a couple comments.  These are not meant to
discourage you in any way, but comments from the peanut gallery.

On Fri, Mar 30, 2012 at 11:24 AM, Mihai Carabas <mihai.carabas at gmail.com> wrote:
>
> About this project, after browsing the article [3] you recomanded and linux
> sources, I elaborated some impotant steps to follow.
>
> In order to add support for the SMT to the scheduler, I have to establish
> answers for the following basic questions:
> 1. which CPUs are siblings (share the same core)
> 2. which is the status of each core
> 3. where to schedule

I'd personally love to see a more simple and adaptable means myself.  In general
there are not all *that* many execution units (cores, threads, etc) on
any system.
Even in a system with 256 execution units, there are only 256^2=64K ways of
transitioning between two different cores.  Also, since these
transitions tend to
be the same (in terms of cost) no matter which direction you are going, ie:
cost(a -> b) == cost(b -> a), it ends up being 1/2 that amount you need to keep
track of.

So, instead of groveling through various NUMA tables, cpu affinity information,
and other such things, I think it would be easier/better to discover
these things
during the reconfiguration process.  Whenever a execution unit is added/removed
from the set of schedulable entities.  In this way, adding/removing
execution units
could possibly be implemented by, for example, marking the cost of migrating to
an execution unit as being very very high.

Of course, this really is just the tip of the problem.  The much
bigger problem is
one of being able to manage both latency (for interrupts) and
throughput (for other
things) in a meaningful manner.  The two that I can see having a
pretty big impact
would be the managing of the cost of memory accesses being "local" to the cpu
in a NUMA (think amd, etc) machine.  The cost of moving memory across to the
newly scheduled local cpu's memory bus will tend to be significantly higher than
just moving the process executing from one cpu to another.  The second that I
can see having a high impact is keeping interrupt code "close" to the cpu and
memory/bus where the interrupt is happening.

> After a brief documentation I can start working on the following goals:
>
> A. Detect how many threads have each core
>  -  adding a generic way of detecting how many threads each core has (create
> a map of siblings for example)

Is this similar to what I was yaking about above?  :)

>  -  guard all new added the information regarding HyperThreading. For this,
> I will add an option (Enable HT) in the config file [5].

I'm not 100% sure how the HT stuff affects things today.  It used to
be that HT was
definitely about shared resources, and running an "MP" style workload
on an HT cpu
was really not going to get you much.  I think that most cpu's today
have much better
HT implementations, and they can largely be treated as a full "MP"
core.  Matt may
have some better insights into this though.

> B. Effective SMT aware scheduling task
>    In order to achieve the final goal, I have to make a design of how
> threads will get scheduled. I will have to treat different cases:
> - "passive" scheduling - select a free physical CPU (if available) for the
> next thread to schedule
> - "active" scheduling - if a physical CPU becomes idle and on another CPU
> are running two threads, migrate one of them on the idle physical cpu
> - what task to get from a list - get the one that was scheduled on that
> domain (physical CPU), no matter on what logical CPU has run before
> - tasks should stick to the same domain (physical CPU)

In general, I believe that "moving" threads to a new CPU should be "harder" than
just "Oh, look, a free core, and we have more than 1 thread here,
let's move" type
of heuristic.  There should be some sort of hysteresis, or resistance
to this type
of movement, such that we don't end up flip-flopping all the time.
Also, there are
different reasons or goals to scheduling.  The effective/best use of
all cpu cores
is one of them.  IE: using all the cores as much as possible.  Another
is to identify
when cores/resources (such as memory, bus bandwidth, etc) are being too highly
utilized, and then splitting load.  Conversely consolidating when
things "shrink" in
the utilization department.  Note, that the first, naiive method of
scheduling will
end up eating battery life on a laptop.  You're keeping all the cores
as busy as you
possibly can, while other methods may not be as optimal in terms of cycles
available to compute things, they may significantly change the power profile of
the machine.

I certainly don't expect anyone to solve these things all at once, but
it would be
really cool if these things are kept in mind, and the method used to solve one
item, could be general enough that the basic infrastructure can be used to solve
the same thing for another goal.  In other words, if I wish to create a "laptop"
scheduler, I should be able to use the same data structures and methods as your
"high performance" scheduler, but simply change the "logic" part to
suit my goals.
Does that make sense?

> I would be glad to know your opinion on the above.
>
> Best regards,
> Mihai

Again, most cool.  Best of luck.  :)

-Toby.