git: kernel - Overhaul namecache operations to reduce SMP contention
Matthew Dillon
dillon at crater.dragonflybsd.org
Sun Dec 4 09:41:15 PST 2016
commit 3536c341ffda90bfdcc8310ef91231f18c81db52
Author: Matthew Dillon <dillon at apollo.backplane.com>
Date: Sun Dec 4 09:21:45 2016 -0800
kernel - Overhaul namecache operations to reduce SMP contention
* Overhaul the namecache code to remove a significant amount of cacheline
ping-ponging from the namecache paths. This primarily effects
multi-socket systems but also improves multi-core single-socket systems.
Cacheline ping-ponging in the critical path can constrict a multi-core
system to roughly ~1-2M operations per second running through that path.
For example, even if looking up different paths or stating different
files, even something as simple as a non-atomic ++global_counter
seriously derates performance when it is being executed on all cores at
once.
In the simple non-conflicting single-component stat() case, this improves
performance from ~2.5M/second to ~25M/second on a 4-socket 48-core opteron
and has a similar improvement on a 2-socket 32-thread xeon, as well as
significantly improves namecache perf on single-socket multi-core systems.
* Remove the vfs.cache.numcalls and vfs.cache.numchecks debugging counters.
These global counters caused significant cache ping-ponging and were only
being used for debugging.
* Implement a poor-man's referenced-structure pcpu cache for struct mount
and struct namecache. This allows atomic ops on the ref-count for these
structures to be avoided in certain critical path cases. For now limit
to ncdir and nrdir (nrdir particularly, which is usually the same across
nearly all processes in the system). Eventually we will want to expand
this cache to handle more cases.
Because we are holding refs persistently, add a bit of infrastructure to
clear the cache as necessary (e.g. when doing an unmount, for example).
* Shift the 'cachedvnodes' global to a per-cpu accumulator, then roll-up
the counter back to the global approximately once per second. The code
critical paths adjust only the per-cpu accumulator, removing another
global cache ping-pong from nearly all vnode and nlookup paths.
* The nlookup structure now 'Borrows' the ucred reference from td->td_ucred
instead of crhold()ing it, removing another global ref/unref from all
nlookup paths.
* We have a large hash table of spinlocks for nchash, add a little pad
from 24 to 32 bytes. Its ok that two spin locks share the same cache
line (its a huge table), adding the pad cleans up cacheline-crossing
cases.
* Add a bit of pad to put mount->mnt_refs on its own cache-line verses
prior fields which are accessed shared. But don't bother isolating it
completely.
Summary of changes:
sys/kern/vfs_cache.c | 216 ++++++++++++++++++++++++++++++++++++++----------
sys/kern/vfs_lock.c | 36 ++++++--
sys/kern/vfs_mount.c | 12 ++-
sys/kern/vfs_nlookup.c | 20 +++--
sys/kern/vfs_syscalls.c | 8 ++
sys/sys/globaldata.h | 3 +-
sys/sys/mount.h | 1 +
sys/sys/namecache.h | 2 +
sys/sys/nchstats.h | 4 +-
sys/sys/nlookup.h | 3 +-
sys/sys/vnode.h | 2 +-
11 files changed, 240 insertions(+), 67 deletions(-)
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/3536c341ffda90bfdcc8310ef91231f18c81db52
--
DragonFly BSD source repository
More information about the Commits
mailing list