NVMe performance improvements in master

Sat Jul 16 23:47:45 PDT 2016

    I've made significant progress on NVMe performance.  On a brand-new
    server (2 x Xeon 2620-v4, 16-core/32-thread, 128GB ram), with PCIe-3
    slots, testing two Samsung and one Intel NVMe card, I was able to
    achieve 931227 IOPS+ with highly parallelized 4K random reads from
    a urandom-filled partition (i.e. no compression, no dummy I/O full of
    zeros).  And the system is 75% idle while its running.

		>>> yes, you heard me, that's 931K IOPS <<<

    I've compiled some before-and-after statistics here:

	    http://apollo.backplane.com/DFlyMisc/nvme_sys03.txt

    Progress has been made in the pbuf subsystem (used by physio), and the
    MMU page invalidation subsystem.  Additional work will be needed to
    achieve these results through a filesystem.  The remaining roadblocks
    for getting this stupendously huge level of performance through our
    filesystems are as follows:

    (1) Filesystem data check, de-duplication, and compression overheads.

    (2) Kernel_pmap updates requiring SMP invalidations (an IPI to all cpus).

    (3) Lock contention in the filesystem and buffer cache path.
 
    (4) Hardware-level cache coherency load from atomic ops.

    Though, in fact, the filesystem will generally not be doing 4K I/Os.
    Most of these roadblocks, all except #(1), drop away with 32K and 64K
    I/Os.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>