segment pmap optimizations implemented for x86-64

Matthew Dillon dillon at
Wed Sep 12 18:44:26 PDT 2012

    Experimental pmap optimizations are now in master and can be enabled
    with a sysctl.

    These optimizations effect ANY shared RW or RO mmap() or sysv shared
    memory attachment which is a multiple of the segment size and which
    is segment aligned, regardless of whether a process is threaded or
    forked or separately exec'd.

    mmap and sysv_shm now also segment-align conforming mappings
    automatically.  The segment size on x86-64 is 2MB.

    Essentially what this does is cause the page table pages across all
    the mappings to be shared.  The page table pages, NOT the terminal pages.
    The actual page tables themselves will be selectively shared.

    This is NOT using 2MB physical pages, at least not yet.  This solves
    the problem particularly with postgres databases in another fashion
    that happens to be generally useful throughout the system.  We might
    implement 2MB physical pages later on, leveraging the new infrastructure,
    but it isn't on my personal list for now.

    This is currently considered VERY experimental.  The feature is disabled
    by default but can be turned on at any time with sysctl


Commit message below:

commit 921c891ecf560602acfc7540df7a760f171e389e

    kernel - Implement segment pmap optimizations for x86-64
    * Implement 2MB segment optimizations for x86-64.  Any shared read-only
      or read-write VM object mapped into memory, including physical objects
      (so both sysv_shm and mmap), which is a multiple of the segment size
      and segment-aligned can be optimized.
    * Enable with sysctl machdep.pmap_mmu_optimize=1
      Default is off for now.  This is an experimental feature.
    * It works as follows:  A VM object which is large enough will, when VM
      faults are generated, store a truncated pmap (PD, PT, and PTEs) in the
      VM object itself.
      VM faults whos vm_map_entry's can be optimized will cause the PTE, PT,
      and also the PD (for now) to be stored in a pmap embedded in the VM_OBJECT
      instead of in the process pmap.
      The process pmap then creates PT entry in the PD page table that points
      to the PT page table page stored in the VM_OBJECT's pmap.
    * This removes nearly all page table overhead from fork()'d processes or
      even unrelated process which massively share data via mmap() or sysv_shm.
      We still recommend using sysctl kern.ipc.shm_use_phys=1 (which is now
      the default), which also removes the PV entries associated with the
      shared pmap.  However, with this optimization PV entries are no longer
      a big issue since they will not be replicated in each process, only in
      the common pmap stored in the VM_OBJECT.
    * Features of this optimization:
      * Number of PV entries is reduced to approximately the number of live
        pages and no longer multiplied by the number of processes separately
        mapping the shared memory.
      * One process faulting in a page naturally makes the PTE available to
        all other processes mapping the same shared memory.  The other processes
        do not have to fault that same page in.
      * Page tables survive process exit and restart.
      * Once page tables are populated and cached, any new process that maps
        the shared memory will take far fewer faults because each fault will
        bring in an ENTIRE page table.  Postgres w/ 64-clients, VM fault rate
        was observed to drop from 1M faults/sec to less than 500 at startup,
        and during the run the fault rates dropped from a steady decline into
        the hundreds of thousands into an instant decline to virtually zero
        VM faults.
      * We no longer have to depend on sysv_shm to optimize the MMU.
      * CPU caches will do a better job caching page tables since most of
        them are now themselves shared.  Even when we invltlb, more of the
        page tables will be in the L1, L2, and L3 caches.

					Matthew Dillon 
					<dillon at>

More information about the Kernel mailing list