git: kernel - Another huge HUGE VM performance improvement for many-cores

Matthew Dillon dillon at crater.dragonflybsd.org
Fri Oct 28 09:49:04 PDT 2011


commit 027193eb611cf4258b84198529d8c88cb733d884
Author: Matthew Dillon <dillon at apollo.backplane.com>
Date:   Fri Oct 28 09:32:51 2011 -0700

    kernel - Another huge HUGE VM performance improvement for many-cores
    
    This requires a bit of explanation.  The last single-point spinlocks in the
    VM system were the spinlocks for the inactive and active queue.  Even though
    these two spinlocks are only held for a very short period of time they can
    create a major point of contention when one has (e.g.) 48 cores all trying
    to run a VM fault at the same time.  This is an issue with multi-socket/
    many-cores systems and not so much an issue with single-socket systems.
    
    On many cores systems the global VM fault rate was limited to around
    ~200-250K zfod faults per second prior to this commit on our 48-core
    opteron test box.  Since any single compiler process can run ~35K zfod
    faults per second the maximum concurrency topped out at around ~7 concurrent
    processes.
    
    With this commit the global VM fault rate was tested to almost 900K zfod
    faults per second.  That's 900,000 page faults per second (about 3.5 GBytes
    per second).  Typical operation was consistently above 750K zfod faults per
    second.  Maximum concurrency at a 35K fault rate per process is thus
    increased from 7 processes to over 25 processes, and is probably approaching
    the physical memory bus limit considering that one also has to take into
    account generic page-fault overhead above and beyond the memory impact on the
    page itself.
    
    I can't stress enough how important it is to avoid contention entirely when
    possible on a many-cores system.  In this case even though the VM page queue
    spinlocks are only held for a very short period of time, the convulsing of
    the cache coherency management between physical cpu sockets when all the
    cores need to use the spinlock still created an enormous bottleneck.  Fixing
    this one spinlock easily doubled concurrent compiler performance on our
    48-core opteron.
    
    * Fan-out the PQ_INACTIVE and PQ_ACTIVE page queues from 1 queue to
      256 queues, each with its own spin lock.
    
    * This removes the last major contention point in the VM system.
    
    * -j48 buildkernel test on monster (48-core opteron) now runs in 55 seconds.
      It was originally 167 seconds, and 101 seconds just prior to this commit.
    
      Concurrent compiles are now three times faster (a +200% improvement) on
      a many-cores box, with virtually no contention at all.

Summary of changes:
 sys/vm/vm_contig.c    |   16 ++-
 sys/vm/vm_page.c      |   47 ++++---
 sys/vm/vm_page.h      |   10 +-
 sys/vm/vm_pageout.c   |  406 ++++++++++++++++++++++++++++++-------------------
 sys/vm/vm_swap.c      |  112 ++++++++------
 sys/vm/vm_swapcache.c |   81 ++++++----
 6 files changed, 400 insertions(+), 272 deletions(-)

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/027193eb611cf4258b84198529d8c88cb733d884


-- 
DragonFly BSD source repository





More information about the Commits mailing list