git: kernel - Fix bottlenecks that develop when many processes are running

Sat Aug 12 11:41:34 PDT 2017

commit e6b81333e1c342741087c579a9407da5bf4eca62
Author: Matthew Dillon <dillon at apollo.backplane.com>
Date:   Sat Aug 12 10:26:17 2017 -0700

    kernel - Fix bottlenecks that develop when many processes are running
    
    * When a large number of processes or threads are running (in the tens of
      thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop.
      These bottlenecks do not develop when only a few thousand threads
      are present.
    
      By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured
      or manually set high enough, DFly can now handle hundreds of thousands
      of active processes running, polling, sleeping, whatever.
    
      Tested to around 400,000 discrete processes (no shared VM pages) on
      a 32-thread dual-socket Xeon system.  Each process is placed in a
      1/10 second sleep loop using umtx timeouts:
    
      baseline 		- (before changes), system bottlenecked starting
    			  at around the 30,000 process mark, eating all
    			  available cpu, high IPI rate from hash
    			  collisions, and other unrelated user processes
    			  bogged down due to the scheduling overhead.
    
      200,000 processes	- System settles down to 45% idle, and low IPI
    			  rate.
    
      220,000 processes	- System 30% idle and low IPI rate
    
      250,000 processes	- System 0% idle and low IPI rate
    
      300,000 processes	- System 0% idle and low IPI rate.
    
      400,000 processes	- Scheduler begins to bottleneck again after the
    			  350,000 while the process test is still in its
    			  fork/exec loop.
    
    			  Once all 400,000 processes are settled down,
    			  system behaves fairly well.  0% idle, modest
    			  IPI rate averaging 300 IPI/sec/cpu (due to
    			  hash collisions in the wakeup code).
    
    * More work will be needed to better handle processes with massively
      shared VM pages.
    
      It should also be noted that the system does a *VERY* good job
      allocating and releasing kernel resources during this test using
      discrete processes.  It can kill 400,000 processes in a few seconds
      when I ^C the test.
    
    * Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan.
      This bottleneck does not arise when large numbers of processes are
      running in usermode, because typically only one user process per cpu
      will be scheduled to LWKT.
    
      However, this bottleneck does arise when large numbers of threads
      are woken up in-kernel.  While in-kernel, a thread schedules directly
      to LWKT.  Round-robin operation tends to result in appends to the tail
      of the queue, so this optimization saves an enormous amount of cpu
      time when large numbers of threads are present.
    
    * Limit ncallout to ~5 minutes worth of ring.  The calculation code is
      primarily designed to allocate less space on low-memory machines,
      but will also cause an excessively-sized ring to be allocated on
      large-memory machines.  512MB was observed on a 32-way box.
    
    * Remove vm_map->hint, which had basically stopped functioning in a
      useful manner.  Add a new vm_map hinting mechanism that caches up to
      four (size, align) start addresses for vm_map_findspace().  This cache
      is used to quickly index into the linear vm_map_entry list before
      entering the linear search phase.
    
      This fixes a serious bottleneck that arises due to vm_map_findspace()'s
      linear scan if the vm_map_entry list when the kernel_map becomes
      fragmented, typically when the machine is managing a large number of
      processes or threads (in the tens of thousands or more).
    
      This will also reduce overheads for processes with highly fragmented
      vm_maps.
    
    * Dynamically size the action_hash[] array in vm/vm_page.c.  This array
      is used to record blocked umtx operations.  The limited size of the
      array could result in an excessive number of hash entries when a large
      number of processes/threads are present in the system.  Again, the
      effect is noticed as the number of threads exceeds a few tens of
      thousands.

Summary of changes:
 sys/kern/kern_synch.c  |   6 +-
 sys/kern/lwkt_thread.c |  25 ++++++-
 sys/kern/subr_param.c  |   7 ++
 sys/vm/vm_glue.c       |  10 +--
 sys/vm/vm_map.c        | 188 ++++++++++++++++++++++++++-----------------------
 sys/vm/vm_map.h        |  30 +++++++-
 sys/vm/vm_page.c       |  70 ++++++++++++------
 7 files changed, 213 insertions(+), 123 deletions(-)

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/e6b81333e1c342741087c579a9407da5bf4eca62


-- 
DragonFly BSD source repository