git: kernel - scheduler adjustments for large ncpus / 48-core monster

Matthew Dillon dillon at
Sat Dec 18 01:05:28 PST 2010

commit 2a4189307741dbcfbe11b31d6cc51a4fb39a8cde
Author: Matthew Dillon <dillon at>
Date:   Sat Dec 18 00:42:52 2010 -0800

    kernel - scheduler adjustments for large ncpus / 48-core monster
    * Change the LWKT scheduler's token spinning algorithm.  It used to
      DELAY a short period of time and then simply retry, creating a lot
      of contention between cpus trying to acquire a token.
      Now the LWKT scheduler uses a FIFO index mechanic to resequence the
      contending cpus into 1uS retry slots using essentially just
      atomic_fetchadd_int(), so it is very cache friendly.  The spin-retry
      thus has a bounded cache management traffic load regardless of
      the number of cpus and contending cpus will not be tripping over
      each other.
      The new algorithm slightly regresses 4-cpu operation (~5% under heavy
      contention) but significantly improves 48-cpu operation.  It is also
      flexible enough for further work down the road.  The old algorithm
      simply did not scale very well.
      Add three sysctls:
      sysctl lwkt.spin_method=1
    	0    Allow a user thread to be scheduled on a cpu while kernel
    	     threads are contended on a token, using the IPI mechanic
    	     to interrupt the user thread and reschedule on decontention.
    	     This can potentially result in excessive IPI traffic.
    	1    Allow a user thread to be scheduled on a cpu while kernel
    	     threads are contended on a token, reschedule on the next clock
    	     tick (100 Hz typically).  Decontention will NOT generate
    	     any IPI traffic.  DEFAULT.
    	2    Do not allow a user thread to be scheduled on a cpu while
    	     kernel threads are contended.  Should not be used normally,
    	     for debugging only.
      sysctl lwkt.spin_delay=1
    	Slot time in microseconds, default 1uS.  Recommended values are
    	1 or 2 but not longer.
      sysctl lwkt.spin_loops=10
    	Number of times the LWKT scheduler loops on contended threads
    	before giving up and allowing an idle-thread HLT.  In order to
    	wake up from the HLT decontention will cause an IPI so you do
    	not want to set this value too small and.  Values between
    	10 and 100 are recommended.
    * Redo the token decontention algorithm.  Use a new gd_reqflags flag,
      RQF_WAKEUP, coupled with RQF_AST_LWKT_RESCHED in the per-cpu globaldata
      structure to determine what cpus actually need to be IPId on token
      decontention (to wakeup their idle threads stuck in HLT).
      This requires that all gd_reqflags operations use locked atomic
      instructions rather than non-locked instructions.
    * Decontention IPIs are a last-gasp effort if the LWKT scheduler has spun
      too many times.  Under normal conditions, even under heavy contention,
      actual IPIing should be minimal.

Summary of changes:
 sys/cpu/i386/include/cpu.h                  |   24 +-
 sys/cpu/x86_64/include/cpu.h                |   19 +-
 sys/kern/lwkt_thread.c                      |  342 +++++++++++++++++++--------
 sys/kern/lwkt_token.c                       |   92 +++++++-
 sys/platform/pc32/i386/trap.c               |    6 +-
 sys/platform/pc32/isa/intr_machdep.c        |    2 +-
 sys/platform/pc32/isa/ipl_funcs.c           |    2 +-
 sys/platform/pc64/isa/intr_machdep.c        |    2 +-
 sys/platform/pc64/x86_64/ipl_funcs.c        |    2 +-
 sys/platform/pc64/x86_64/trap.c             |    6 +-
 sys/platform/vkernel/i386/trap.c            |    2 +-
 sys/platform/vkernel/platform/ipl_funcs.c   |    2 +-
 sys/platform/vkernel/platform/machintr.c    |    8 +-
 sys/platform/vkernel64/platform/ipl_funcs.c |    2 +-
 sys/platform/vkernel64/platform/machintr.c  |    8 +-
 sys/platform/vkernel64/x86_64/trap.c        |    6 +-
 sys/vm/vm_fault.c                           |   26 ++-
 sys/vm/vnode_pager.c                        |    2 +-
 18 files changed, 381 insertions(+), 172 deletions(-)

DragonFly BSD source repository

More information about the Commits mailing list