Update on recent SMP contention work

Wed Oct 16 23:34:44 PDT 2013

    A whole lot more work on reducing SMP contention has gone into master
    recently and will be in the upcoming release:

	Name cache shared lock fix.  Most concurrent path lookups are now
	non-contending through the entire code stack.

	More use of shared spinlocks in the pmap code (+ fixes).  Most
	concurrent VM faults are now non-contending through the entire
	code stack.

	Filesystem syncer improvements.  Syncer now tracks dirty vnodes with
	dirty inodes and with possibly dirty VM pages (via mmap), in
	addition to vnodes with dirty buffer cache buffers.  nfs, tmpfs,
	and hammer now support a mechanism to scan the tracked vnodes instead
	of scanning all vnodes.  This makes 'sync' and the automatic
	filesystem syncer much more efficient.

	Fork and Fork/Exec code paths are now vastly more efficient due to
	greatly reduced lock contention.  Primarily driven by avoiding
	unnecessary tracking of VM shadow chains on terminal vnodes (which
	inevitably is the executable binary), allowing shared locks to be
	used for terminal vnodes during a fork or exec.

	The per-cpu process reaper (handles exit/wait) now uses a per-cpu
	token rather than a global token.

	Various pid-related improvements, such as removing the totally
	unnecessary acquisition of a global token when looking up your
	own process pid.

    The jist of this work is that there is no longer virtually any
    contention for most process-related activities, including heavy use
    of fork and fork/exec in 'make', '/bin/sh', and other utilities.
    Anything which forks and/or execs a lot (scripts, bulk builds, service
    daemons, etc) will now run as close to optimally as it is possible to
    run on a multi-core box.

    In particular with the last change to the namecache code, our bulk
    ports builds look pretty insane on monster (our 48-core opteron box).
    Now during a bulk dports build, the load can pop up to 300 with concurrent
    compiles and of that 300 there will be 295 non-contending "R"un state
    processes and only 5 contending "D" state processes.  And it all happens
    with virtually *NO* IPI traffic between cpus.

    I consider this a fairly major milestone for the project.  We aren't
    finished, but this is a major leap in our ability to fully utilize the
    resources on larger multi-core systems.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>