git: kernel - scheduler adjustments for large ncpus / 48-core monster
Matthew Dillon
dillon at crater.dragonflybsd.org
Sat Dec 18 01:05:28 PST 2010
commit 2a4189307741dbcfbe11b31d6cc51a4fb39a8cde
Author: Matthew Dillon <dillon at apollo.backplane.com>
Date: Sat Dec 18 00:42:52 2010 -0800
kernel - scheduler adjustments for large ncpus / 48-core monster
* Change the LWKT scheduler's token spinning algorithm. It used to
DELAY a short period of time and then simply retry, creating a lot
of contention between cpus trying to acquire a token.
Now the LWKT scheduler uses a FIFO index mechanic to resequence the
contending cpus into 1uS retry slots using essentially just
atomic_fetchadd_int(), so it is very cache friendly. The spin-retry
thus has a bounded cache management traffic load regardless of
the number of cpus and contending cpus will not be tripping over
each other.
The new algorithm slightly regresses 4-cpu operation (~5% under heavy
contention) but significantly improves 48-cpu operation. It is also
flexible enough for further work down the road. The old algorithm
simply did not scale very well.
Add three sysctls:
sysctl lwkt.spin_method=1
0 Allow a user thread to be scheduled on a cpu while kernel
threads are contended on a token, using the IPI mechanic
to interrupt the user thread and reschedule on decontention.
This can potentially result in excessive IPI traffic.
1 Allow a user thread to be scheduled on a cpu while kernel
threads are contended on a token, reschedule on the next clock
tick (100 Hz typically). Decontention will NOT generate
any IPI traffic. DEFAULT.
2 Do not allow a user thread to be scheduled on a cpu while
kernel threads are contended. Should not be used normally,
for debugging only.
sysctl lwkt.spin_delay=1
Slot time in microseconds, default 1uS. Recommended values are
1 or 2 but not longer.
sysctl lwkt.spin_loops=10
Number of times the LWKT scheduler loops on contended threads
before giving up and allowing an idle-thread HLT. In order to
wake up from the HLT decontention will cause an IPI so you do
not want to set this value too small and. Values between
10 and 100 are recommended.
* Redo the token decontention algorithm. Use a new gd_reqflags flag,
RQF_WAKEUP, coupled with RQF_AST_LWKT_RESCHED in the per-cpu globaldata
structure to determine what cpus actually need to be IPId on token
decontention (to wakeup their idle threads stuck in HLT).
This requires that all gd_reqflags operations use locked atomic
instructions rather than non-locked instructions.
* Decontention IPIs are a last-gasp effort if the LWKT scheduler has spun
too many times. Under normal conditions, even under heavy contention,
actual IPIing should be minimal.
Summary of changes:
sys/cpu/i386/include/cpu.h | 24 +-
sys/cpu/x86_64/include/cpu.h | 19 +-
sys/kern/lwkt_thread.c | 342 +++++++++++++++++++--------
sys/kern/lwkt_token.c | 92 +++++++-
sys/platform/pc32/i386/trap.c | 6 +-
sys/platform/pc32/isa/intr_machdep.c | 2 +-
sys/platform/pc32/isa/ipl_funcs.c | 2 +-
sys/platform/pc64/isa/intr_machdep.c | 2 +-
sys/platform/pc64/x86_64/ipl_funcs.c | 2 +-
sys/platform/pc64/x86_64/trap.c | 6 +-
sys/platform/vkernel/i386/trap.c | 2 +-
sys/platform/vkernel/platform/ipl_funcs.c | 2 +-
sys/platform/vkernel/platform/machintr.c | 8 +-
sys/platform/vkernel64/platform/ipl_funcs.c | 2 +-
sys/platform/vkernel64/platform/machintr.c | 8 +-
sys/platform/vkernel64/x86_64/trap.c | 6 +-
sys/vm/vm_fault.c | 26 ++-
sys/vm/vnode_pager.c | 2 +-
18 files changed, 381 insertions(+), 172 deletions(-)
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/2a4189307741dbcfbe11b31d6cc51a4fb39a8cde
--
DragonFly BSD source repository
More information about the Commits
mailing list