GSoC: Add SMT/HT awareness to DragonFlyBSD scheduler

Tue Sep 11 08:37:49 PDT 2012

Hi,

As I promised in one of my previous e-mails, I will post on the list some of
my discussions with Matthew regarding the scheduling subsystems, may
be it will be useful to someone else:

:Let say we have a user CPU bound process (batchy one). The
:bsd4_schedulerclock will notice this and will mark a need for user
:rescheduling (need_user_resched();). This flag is only checked in
:the bsd4_acquire_curproc (which is called when a process returns from
:kernel-space....the code from here is clear to me) and from lwkt_switch().
:My question is, where in the code is called the lwkt_switch, to switch to
:another thread if you have that CPU bound process running? Suppose that CPU
:bound process is never blocking and never enters the kernel....which
:statement from the code is pushing it out of the CPU by calling
:lwkt_switch() ?

    There are two basic mechanisms at work here, and both are rather
    sensitive and easy to break (I've broken and then fixed the mechanism
    multiple times over the years).

    The first is that when a LWKT thread is scheduled to a cpu that sets
    a flag for that cpu indicating that a LWKT reschedule may be needed.
    The scheduling of a LWKT thread on a cpu always occurs on that cpu
    (so if scheduled from a different cpu an IPI interrupt/message is sent
    to the target cpu and the actual scheduling is done on the target cpu).

    This IPI represents an interrupt to whatever is running on that cpu,
    thus interrupting any user code (for example), which triggers a sequence
    of events which allow LWKT to schedule the higher priority thread.

    The second mechanism is when a userland thread (LWP) is scheduled.  It
    works very similarly to the first mechanism but is a bit more complex.
    The LWP is not directly scheduled on the target cpu.  Instead the LWP
    is placed in the global userland scheduler queue(s) and a target cpu
    is selected and an IPI is sent to that target cpu.  (see the
    'need_user_resched_remote' function in usched_bsd4.c, which is executed
    by the IPI message).  The IPI is only sent if the originating cpu
    determines that the LWP has a higher priority than the LWP currently
    running on the target cpu.

    There is a third mechanism for userland threads related to the helper
    thread (see 'sched_thread' in usched_bsd4.c).  The helper thread is
    intended to only handle scheduling a userland thread on its cpu when
    nothing is running on that cpu at all.

    There is a fourth mechanism for userland theads (well, also for LWKT
    threads but mainly for userland threads), and that is the dynamic
    priority and scheduler timer interrupt mechanic.  This timer interrupt
    occurs 100 times a second and adjusts the dynamic priority of the
    currently running userland thead and also checks for round-robining
    same-priority userland threads.  When it detects that a reschedule is
    required it flags a user reschedule via need_user_resched().

    Under this forth mechanism cpu-bound user processes will tend to
    round-robin on 1/25 second intervals, approximately.

    There is a fifth mechanism that may not be apparent, and that is the
    handling of an interactive user process.  Such processes are nearly
    always sleeping but have a high priority, but because they are sleeping
    in kernel-land (not userland), they will get instantly scheduled via
    the LWKT scheduler when they are woken up (e.g. by a keystroke), causing
    a LWKT reschedule that switches to them over whatever user thread is
    currently running.  Thus the interactive userland thread will
    immediately continue running in its kernel context and then when
    attempting to return to userland it will determine if its dynamic
    user priority is higher than the current designated user thread's
    dynamic priority.  If it isn't it goes back onto the usched_bsd4's
    global queue (effectively doesn't return to userland immediately), if it
    is then it 'takes over' as the designated 'user' thread for that spot
    and returns to userland immediately.

    The key thing to note w/ the fifth mechanism is that it will instantly
    interrupt and switch away from the current running thread on the given
    cpu if that thread is running in userland, but then leaves it up to
    the target thread (now LWKT scheduled and running) to determine, while
    still in the kernel, whether matters should remain that way or not.

:Another question: it is stated that on the lwkt scheduler queue is only a
:user process at a time. The correct statement is: only a user process that
:is running in user-space, also may be other user-processes that are running
:in kernel-space. Is that right?

    Yes, this is correct.  A user thread running in kernel space is removed
    from the user scheduler if it blocks while in kernel space and becomes
    a pure LWKT thread.  Of course, all user threads are also LWKT threads,
    so what I really mean t say here is that a user thread running in kernel
    space is no longer subject to serialization of user threads on that
    particular cpu.

    When the user thread tries to return to userland then it becomes subject
    to serialization again.

    It is, in fact, possible to run multiple userland threads in userland
    via the LWKT scheduler instead of just one, but we purposefully avoid
    doing it because the LWKT scheduler is not really a dynamic scheduler.
    People would notice severe lag and other issues when cpu-bound and
    IO-bound processes are mixed if we were to do that.

    --

    All threads not subject to the userland scheduler (except for the
    userland scheduler helper thread) run at a higher LWKT priority than
    the (one) thread that might be running in userland.

    There are two separate current-cpu notifications (also run indirectly
    for remote cpus via an IPI to the remote cpu).  One called
    need_lwkt_resched() and applies when a LWKT reschedule might be needed,
    and the other is need_user_resched() and applies when a user LWP
    reschedule might be needed.

    Lastly, another reminder:  Runnable but not-currently-running userland
    threads are placed in the usched_bsd4's global queue and are not LWKT
    scheduled until the usched_bsd4 userland scheduler tells them to run.
    If you have ten thousand cpu-bound userland threads and four cpus,
    only four of those threads will be LWKT scheduled at a time (one on
    each cpu), and the remaining 9996 threads will be left on the
    usched_bsd4's global queue.

:When it detects that a reschedule is
:required it flags a user reschedule. But that CPU bound process will
:continue
:running. Who actually interrupts it?

   The IPI flags the user reschedule and then returns.  This returns through
    several subroutine levels until it gets to the actual interrupt dispatch
    code.  The interrupt dispatch code then returns from the interrupt by
    calling 'doreti'.

    See /usr/src/sys/platform/pc64/x86_64/exception.S

    The doreti code is in:

        /usr/src/sys/platform/pc64/x86_64/ipl.s

    The doreti code is what handles poping the final stuff off the
    supervisor stack and returning to userland.

    However, this code checks gd_reqflags before returning to userland.  If
    it detects a flag has been set it does not return to userland but instead
    pushes a trap context and calls the trap() function with T_ASTFLT
    (see line 298 of ipl.s).

    The trap() function is in platform/pc64/x86_64/trap.c

    This function's user entry and exit code handles the scheduling issues
    related to the trap.  LWKT reschedule requests are handled simply by
    calling lwkt_switch().  USER reschedule requests are handled by
    releasing the user scheduler's current process and then re-acquiring it
    (which can wind up placing the current process on the user global scheduler
    queue and blocking if the current process is no longer the highest
    priority runnable process).

    This can be a confusing long chain of events but that's basically how it
    works.  We only want to handle user scheduling related events on the
    boundary between userland and kernelland, and not in the middle of some
    random kernelland function executing on behalf of userland.

:I wasn't able to figure out what happens exactly in this case:
:- when a thread has its time quantum expired, it will be asked for a
:reschedule (flags for a reschedule only, it doesn't receive any IPI). That
:thread will remain on the lwkt queue of the CPU was running on? Or will end
:up on the userland scheduler queue again to be subject of a new schedule?.

    The time quantum is driven by the timer interrupt, which ultimately
    calls bsd4_schedulerclock() on every cpu.  At the time this function
    is called the system is, of course, running in the kernel.  That is,
    the timer interrupt interrupted the user program.

    So if this function flags a reschedule the flag will be picked up when
    the timer interrupt tries to return to userland via doreti ... it will
    detect the flag and instead of returning to userland it will generate
    an AST trap to the trap() function.  The standard userenter/userexit
    code run by the trap() function handles the rest.

:This question doesn't include the case when that thread is blocking
:(waiting I/O or something else) - in this case the thread will block in the
:kernel and when will want to return to userland will need to reacquire that
:cpu was running on userland).
:
:I put this question because in the ULE FreeBSD, they are always checking
:for a balanced topology of the process execution. If they detect an
:imbalance they migrate threads accordingly. In our case, this would be
:induced by the fact that processes will end up, at some moment in time, on
:the userland queue and would be subject of rescheduling (here the
:heuristics will take care not to imbalance the topology).

    In our case if the currently running user process loses its current
    cpu due to the time quantum running out, coupled with the fact that
    other user processes want to run at a better or same priority, then
    the currently running user process will wind up back on the bsd4
    global queue.  If other cpus are idle or running lower priority processes
    then the process losing the current cpu will wind up being immediately
    rescheduled on one of the other cpus.

:Another issue is regarding the lwkt scheduler. It is somehow a static
:scheduler, a thread from one cpu can't migrate on the another. We intend to
:leave it that way, and all our SMT heuristics to be implemented in the
:userland scheduler or do you have some ideas that we would gain some
:benefits in the SMT cases?

    A LWKT thread cannot migrate preemptively (this would be similar to
    'pinning' on FreeBSD, except in DragonFly kernel threads are always
    pinned).  A thread can be explicitly migrated.  Threads subject to
    the userland scheduler can migrate between cpus by virtue of being
    placed back on the bsd4 global queue (and descheduled from LWKT).
    Such threads can be pulled off the bsd4 global queue by the bsd4
    userland scheduler from any cpu.

    In DragonFly kernel threads are typically dedicated to particular
    cpus but are also typically replicated across multiple cpus.  Work
    is pre-partitioned and sent to particular threads which removes most
    of the locking requirements.

    In FreeBSD kernel threads run on whatever cpus are available, but
    try to localize heuristically to maintain cache locality.  Kernel
    threads typically pick work off of global queues and tend to need
    heavier mutex use.

    So, e.g. in DragonFly a 'ps ax' you will see several threads which are
    per-cpu, like the crypto threads, the syncer threads, the softclock
    threads, the network protocol threads (netisr threads), and so forth.

    --

    Now there are issues with both ways of doing things.  In DragonFly we
    have problems when a kernel thread needs a lot of cpu... for example,
    if a crypto thread needs a ton of cpu it can starve out other kernel
    threads running on that particular cpu.  Threads with a user context
    will eventually migrate off the cpu with the cpu-hungry kernel thread
    but the mechanism doesn't work as well as it could.

    So, in DragonFly, we might have to revamp the LWKT scheduler somewhat
    to handle these cases and allow kernel threads to un-pin when doing
    certain cpu-intensive operations.  Most kernel threads don't have this
    issue.  Only certain threads (like the crypto threads) are heavy cpu
    users.

    I am *very* leery of using any sort of global scheduler queue for LWKT.
    Very, very leery.  It just doesn't scale well to many-cpu systems.
    For example, on monster.dragonflybsd.org one GigE interface can vector
    8 interrupts to 8 different cpus with a rate limiter on each one...
    but at, say, 10,000hz x 8 that's already 80,000 interrupts/sec globally.
    It's a drop in the bucket when the schedulers are per-cpu but starts to
    hit bottlenecks when the schedulers have a global queue.

    In FreeBSD they have numerous issues with preemptive cpu switching
    of threads running in the kernel due to the mutex model, even with
    the fancy priority inheritance features they've added.  They also
    have to explicitly pin a thread in order to access per-cpu globaldata,
    or depend on an atomic access.  And FreeBSD depends on mutexes in
    critical paths while DragonFly only need critical sections in similar
    paths due to its better pre-partitioning of work.  DragonFly has better
    cpu localization but FreeBSD has better load management.

    However, both OSs tend to run into problems with interactivity from
    edge cases under high cpu loads.

Enjoy reading:),
Mihai Carabas