Sun Dec 4 09:21:45 2016 -0800

commit 3536c341ffda90bfdcc8310ef91231f18c81db52
Author: Matthew Dillon <dillon at>
Date:   Sun Dec 4 09:21:45 2016 -0800

    kernel - Overhaul namecache operations to reduce SMP contention
    * Overhaul the namecache code to remove a significant amount of cacheline
      ping-ponging from the namecache paths.  This primarily effects
      multi-socket systems but also improves multi-core single-socket systems.
      Cacheline ping-ponging in the critical path can constrict a multi-core
      system to roughly ~1-2M operations per second running through that path.
      For example, even if looking up different paths or stating different
      files, even something as simple as a non-atomic ++global_counter
      seriously derates performance when it is being executed on all cores at
      In the simple non-conflicting single-component stat() case, this improves
      performance from ~2.5M/second to ~25M/second on a 4-socket 48-core opteron
      and has a similar improvement on a 2-socket 32-thread xeon, as well as
      significantly improves namecache perf on single-socket multi-core systems.
    * Remove the vfs.cache.numcalls and vfs.cache.numchecks debugging counters.
      These global counters caused significant cache ping-ponging and were only
      being used for debugging.
    * Implement a poor-man's referenced-structure pcpu cache for struct mount
      and struct namecache.  This allows atomic ops on the ref-count for these
      structures to be avoided in certain critical path cases.  For now limit
      to ncdir and nrdir (nrdir particularly, which is usually the same across
      nearly all processes in the system).  Eventually we will want to expand
      this cache to handle more cases.
      Because we are holding refs persistently, add a bit of infrastructure to
      clear the cache as necessary (e.g. when doing an unmount, for example).
    * Shift the 'cachedvnodes' global to a per-cpu accumulator, then roll-up
      the counter back to the global approximately once per second.  The code
      critical paths adjust only the per-cpu accumulator, removing another
      global cache ping-pong from nearly all vnode and nlookup paths.
    * The nlookup structure now 'Borrows' the ucred reference from td->td_ucred
      instead of crhold()ing it, removing another global ref/unref from all
      nlookup paths.
    * We have a large hash table of spinlocks for nchash, add a little pad
      from 24 to 32 bytes.  Its ok that two spin locks share the same cache
      line (its a huge table), adding the pad cleans up cacheline-crossing
    * Add a bit of pad to put mount->mnt_refs on its own cache-line verses
      prior fields which are accessed shared.  But don't bother isolating it

Summary of changes:
 sys/kern/vfs_cache.c    | 216 ++++++++++++++++++++++++++++++++++++++----------
 sys/kern/vfs_lock.c     |  36 ++++++--
 sys/kern/vfs_mount.c    |  12 ++-
 sys/kern/vfs_nlookup.c  |  20 +++--
 sys/kern/vfs_syscalls.c |   8 ++
 sys/sys/globaldata.h    |   3 +-
 sys/sys/mount.h         |   1 +
 sys/sys/namecache.h     |   2 +
 sys/sys/nchstats.h      |   4 +-
 sys/sys/nlookup.h       |   3 +-
 sys/sys/vnode.h         |   2 +-
 11 files changed, 240 insertions(+), 67 deletions(-)

