GSoC: Add SMT/HT awareness to DragonFlyBSD scheduler
Mihai Carabas
mihai.carabas at gmail.com
Tue Sep 11 08:37:49 PDT 2012
Hi,
As I promised in one of my previous e-mails, I will post on the list some of
my discussions with Matthew regarding the scheduling subsystems, may
be it will be useful to someone else:
:Let say we have a user CPU bound process (batchy one). The
:bsd4_schedulerclock will notice this and will mark a need for user
:rescheduling (need_user_resched();). This flag is only checked in
:the bsd4_acquire_curproc (which is called when a process returns from
:kernel-space....the code from here is clear to me) and from lwkt_switch().
:My question is, where in the code is called the lwkt_switch, to switch to
:another thread if you have that CPU bound process running? Suppose that CPU
:bound process is never blocking and never enters the kernel....which
:statement from the code is pushing it out of the CPU by calling
:lwkt_switch() ?
There are two basic mechanisms at work here, and both are rather
sensitive and easy to break (I've broken and then fixed the mechanism
multiple times over the years).
The first is that when a LWKT thread is scheduled to a cpu that sets
a flag for that cpu indicating that a LWKT reschedule may be needed.
The scheduling of a LWKT thread on a cpu always occurs on that cpu
(so if scheduled from a different cpu an IPI interrupt/message is sent
to the target cpu and the actual scheduling is done on the target cpu).
This IPI represents an interrupt to whatever is running on that cpu,
thus interrupting any user code (for example), which triggers a sequence
of events which allow LWKT to schedule the higher priority thread.
The second mechanism is when a userland thread (LWP) is scheduled. It
works very similarly to the first mechanism but is a bit more complex.
The LWP is not directly scheduled on the target cpu. Instead the LWP
is placed in the global userland scheduler queue(s) and a target cpu
is selected and an IPI is sent to that target cpu. (see the
'need_user_resched_remote' function in usched_bsd4.c, which is executed
by the IPI message). The IPI is only sent if the originating cpu
determines that the LWP has a higher priority than the LWP currently
running on the target cpu.
There is a third mechanism for userland threads related to the helper
thread (see 'sched_thread' in usched_bsd4.c). The helper thread is
intended to only handle scheduling a userland thread on its cpu when
nothing is running on that cpu at all.
There is a fourth mechanism for userland theads (well, also for LWKT
threads but mainly for userland threads), and that is the dynamic
priority and scheduler timer interrupt mechanic. This timer interrupt
occurs 100 times a second and adjusts the dynamic priority of the
currently running userland thead and also checks for round-robining
same-priority userland threads. When it detects that a reschedule is
required it flags a user reschedule via need_user_resched().
Under this forth mechanism cpu-bound user processes will tend to
round-robin on 1/25 second intervals, approximately.
There is a fifth mechanism that may not be apparent, and that is the
handling of an interactive user process. Such processes are nearly
always sleeping but have a high priority, but because they are sleeping
in kernel-land (not userland), they will get instantly scheduled via
the LWKT scheduler when they are woken up (e.g. by a keystroke), causing
a LWKT reschedule that switches to them over whatever user thread is
currently running. Thus the interactive userland thread will
immediately continue running in its kernel context and then when
attempting to return to userland it will determine if its dynamic
user priority is higher than the current designated user thread's
dynamic priority. If it isn't it goes back onto the usched_bsd4's
global queue (effectively doesn't return to userland immediately), if it
is then it 'takes over' as the designated 'user' thread for that spot
and returns to userland immediately.
The key thing to note w/ the fifth mechanism is that it will instantly
interrupt and switch away from the current running thread on the given
cpu if that thread is running in userland, but then leaves it up to
the target thread (now LWKT scheduled and running) to determine, while
still in the kernel, whether matters should remain that way or not.
:Another question: it is stated that on the lwkt scheduler queue is only a
:user process at a time. The correct statement is: only a user process that
:is running in user-space, also may be other user-processes that are running
:in kernel-space. Is that right?
Yes, this is correct. A user thread running in kernel space is removed
from the user scheduler if it blocks while in kernel space and becomes
a pure LWKT thread. Of course, all user threads are also LWKT threads,
so what I really mean t say here is that a user thread running in kernel
space is no longer subject to serialization of user threads on that
particular cpu.
When the user thread tries to return to userland then it becomes subject
to serialization again.
It is, in fact, possible to run multiple userland threads in userland
via the LWKT scheduler instead of just one, but we purposefully avoid
doing it because the LWKT scheduler is not really a dynamic scheduler.
People would notice severe lag and other issues when cpu-bound and
IO-bound processes are mixed if we were to do that.
--
All threads not subject to the userland scheduler (except for the
userland scheduler helper thread) run at a higher LWKT priority than
the (one) thread that might be running in userland.
There are two separate current-cpu notifications (also run indirectly
for remote cpus via an IPI to the remote cpu). One called
need_lwkt_resched() and applies when a LWKT reschedule might be needed,
and the other is need_user_resched() and applies when a user LWP
reschedule might be needed.
Lastly, another reminder: Runnable but not-currently-running userland
threads are placed in the usched_bsd4's global queue and are not LWKT
scheduled until the usched_bsd4 userland scheduler tells them to run.
If you have ten thousand cpu-bound userland threads and four cpus,
only four of those threads will be LWKT scheduled at a time (one on
each cpu), and the remaining 9996 threads will be left on the
usched_bsd4's global queue.
:When it detects that a reschedule is
:required it flags a user reschedule. But that CPU bound process will
:continue
:running. Who actually interrupts it?
The IPI flags the user reschedule and then returns. This returns through
several subroutine levels until it gets to the actual interrupt dispatch
code. The interrupt dispatch code then returns from the interrupt by
calling 'doreti'.
See /usr/src/sys/platform/pc64/x86_64/exception.S
The doreti code is in:
/usr/src/sys/platform/pc64/x86_64/ipl.s
The doreti code is what handles poping the final stuff off the
supervisor stack and returning to userland.
However, this code checks gd_reqflags before returning to userland. If
it detects a flag has been set it does not return to userland but instead
pushes a trap context and calls the trap() function with T_ASTFLT
(see line 298 of ipl.s).
The trap() function is in platform/pc64/x86_64/trap.c
This function's user entry and exit code handles the scheduling issues
related to the trap. LWKT reschedule requests are handled simply by
calling lwkt_switch(). USER reschedule requests are handled by
releasing the user scheduler's current process and then re-acquiring it
(which can wind up placing the current process on the user global scheduler
queue and blocking if the current process is no longer the highest
priority runnable process).
This can be a confusing long chain of events but that's basically how it
works. We only want to handle user scheduling related events on the
boundary between userland and kernelland, and not in the middle of some
random kernelland function executing on behalf of userland.
:I wasn't able to figure out what happens exactly in this case:
:- when a thread has its time quantum expired, it will be asked for a
:reschedule (flags for a reschedule only, it doesn't receive any IPI). That
:thread will remain on the lwkt queue of the CPU was running on? Or will end
:up on the userland scheduler queue again to be subject of a new schedule?.
The time quantum is driven by the timer interrupt, which ultimately
calls bsd4_schedulerclock() on every cpu. At the time this function
is called the system is, of course, running in the kernel. That is,
the timer interrupt interrupted the user program.
So if this function flags a reschedule the flag will be picked up when
the timer interrupt tries to return to userland via doreti ... it will
detect the flag and instead of returning to userland it will generate
an AST trap to the trap() function. The standard userenter/userexit
code run by the trap() function handles the rest.
:This question doesn't include the case when that thread is blocking
:(waiting I/O or something else) - in this case the thread will block in the
:kernel and when will want to return to userland will need to reacquire that
:cpu was running on userland).
:
:I put this question because in the ULE FreeBSD, they are always checking
:for a balanced topology of the process execution. If they detect an
:imbalance they migrate threads accordingly. In our case, this would be
:induced by the fact that processes will end up, at some moment in time, on
:the userland queue and would be subject of rescheduling (here the
:heuristics will take care not to imbalance the topology).
In our case if the currently running user process loses its current
cpu due to the time quantum running out, coupled with the fact that
other user processes want to run at a better or same priority, then
the currently running user process will wind up back on the bsd4
global queue. If other cpus are idle or running lower priority processes
then the process losing the current cpu will wind up being immediately
rescheduled on one of the other cpus.
:Another issue is regarding the lwkt scheduler. It is somehow a static
:scheduler, a thread from one cpu can't migrate on the another. We intend to
:leave it that way, and all our SMT heuristics to be implemented in the
:userland scheduler or do you have some ideas that we would gain some
:benefits in the SMT cases?
A LWKT thread cannot migrate preemptively (this would be similar to
'pinning' on FreeBSD, except in DragonFly kernel threads are always
pinned). A thread can be explicitly migrated. Threads subject to
the userland scheduler can migrate between cpus by virtue of being
placed back on the bsd4 global queue (and descheduled from LWKT).
Such threads can be pulled off the bsd4 global queue by the bsd4
userland scheduler from any cpu.
In DragonFly kernel threads are typically dedicated to particular
cpus but are also typically replicated across multiple cpus. Work
is pre-partitioned and sent to particular threads which removes most
of the locking requirements.
In FreeBSD kernel threads run on whatever cpus are available, but
try to localize heuristically to maintain cache locality. Kernel
threads typically pick work off of global queues and tend to need
heavier mutex use.
So, e.g. in DragonFly a 'ps ax' you will see several threads which are
per-cpu, like the crypto threads, the syncer threads, the softclock
threads, the network protocol threads (netisr threads), and so forth.
--
Now there are issues with both ways of doing things. In DragonFly we
have problems when a kernel thread needs a lot of cpu... for example,
if a crypto thread needs a ton of cpu it can starve out other kernel
threads running on that particular cpu. Threads with a user context
will eventually migrate off the cpu with the cpu-hungry kernel thread
but the mechanism doesn't work as well as it could.
So, in DragonFly, we might have to revamp the LWKT scheduler somewhat
to handle these cases and allow kernel threads to un-pin when doing
certain cpu-intensive operations. Most kernel threads don't have this
issue. Only certain threads (like the crypto threads) are heavy cpu
users.
I am *very* leery of using any sort of global scheduler queue for LWKT.
Very, very leery. It just doesn't scale well to many-cpu systems.
For example, on monster.dragonflybsd.org one GigE interface can vector
8 interrupts to 8 different cpus with a rate limiter on each one...
but at, say, 10,000hz x 8 that's already 80,000 interrupts/sec globally.
It's a drop in the bucket when the schedulers are per-cpu but starts to
hit bottlenecks when the schedulers have a global queue.
In FreeBSD they have numerous issues with preemptive cpu switching
of threads running in the kernel due to the mutex model, even with
the fancy priority inheritance features they've added. They also
have to explicitly pin a thread in order to access per-cpu globaldata,
or depend on an atomic access. And FreeBSD depends on mutexes in
critical paths while DragonFly only need critical sections in similar
paths due to its better pre-partitioning of work. DragonFly has better
cpu localization but FreeBSD has better load management.
However, both OSs tend to run into problems with interactivity from
edge cases under high cpu loads.
Enjoy reading:),
Mihai Carabas
More information about the Kernel
mailing list