Recent MMU, Cached I/O, and Scheduler work now in master
dillon at apollo.backplane.com
Tue Sep 18 17:07:32 PDT 2012
One of our wonderful GSOC projects this summer resulted in Mihai
Carabas adding a cpu topology awareness framework to DragonFly and
doing work on the scheduler to make it more topology aware.
This started several of us on a performance benchmarking binge with
several people, particularly Francois Tigeot running postgres/pgbench
tests on a 12x2 (24 thread) Xeon box and me running tests on a smaller
4x2 (8 thread) Xeon box and our larger 48-core opteron box.
In the last month the master branch has gone through some radical
changes. All the work is in but some still experimental and requires
a sysctl to turn on.
* PMAP MMU optimizations for 64-bit systems. We noticed that when
postgres servers are used with very large shared memory areas,
either with SYSV SHM or MMAP, each postgres server process (which
fork instead of thread) has to fault-in tens of thousand of pages.
When you multiple by a potentially large number of postgres server
processes this turns into millions of faults. In addition, each
process is maintaining its own complete copy of the page table.
This optimization works for SYSV SHM as well as any large shared or
read-only mmap of anonymous or file-backed data. The optimization
causes the actual page table pages themselves to be cached in the
backing VM object (thus not subject to destruction when processes
using the mappings fork() or exit), and the individual MMU maps for
each process actually share the page tables by mapping shared page
table pages. This removes nearly ALL page faults from a warmed-up
postgres server, even if there are hundreds of postgres server
processes forked and even when it does fresh fork()s. In addition,
most of the page tables for these processes are now shared (even
though they were forked and not threaded), thus making far better
use of cpu memory caches.
* Read shortcut through the VM system (integrated w/HAMMER for now).
This doubles the performance of read() system calls from the cache
which would otherwise cause the buffer cache to cycle (when the VM
page cache is big enough to cache the data set but the buffer cache
is not). In this situation the cycling of the buffer cache causes
a large number of SMP MMU invalidations due to the constant adjusting
of VM pages mappings in kernel memory.
With this shortcut cached file data read with read() is copied out
using the DMAP instead of the buffer cache, not only improving read()
performance but also significant improving all activities on
multi-core systems due to the reduced kernel page smashing.
* Scheduler rewrite. Mihai Carabas made large strides in scheduler
performance on larger servers with his cpu topology awareness framework
and his work on our user thread scheduler. However, there were still
significant limitations in the scheduler due to its original design.
The original scheduler was essentially single-threaded, using a global
spinlock to protect a single global scheduling run queue. This lead
to a number of SMP related bottlenecks with the scheduler as well as
complicated the algorithms.
I have now completed a rewrite of the scheduler that incorporates
Mihai's cpu topology infrastructure and rewrites the algorithms to
utilize the new scheduler framework.
The new scheduler utilizes per-cpu queues and fine-grained per-cpu
spinlocks. There are no global spinlocks, removing that bottleneck.
The new scheduler rewrites the cpu topology algorithms to implement
a top-down (whole-machine -> socket -> core -> hyperthread) scheduling
implementation, performing three major algorithmic actions:
(1) It generates a load factor at all levels and load-balances the
assignment of processes to cpus in a topology-aware framework.
This means that if you have 4 processes running in a 4x2 (8-thread)
environment, they will be scheduled to cores and not to competing
hyperthreads. If you have two cpu sockets and two processes, one
will be scheduled to each socket to make best use of their caches.
(2) It will try to avoid migrating processes when possible, and when
not possible it will try to keep them nearby from a
(3) It will detect process block/wakeup events which e.g. tie two
processes together, and will try to move the process pairs closer
to each other using that information.
For example, if you have many postgres clients and servers on a
large server, enough to load down all cores, the client and
server pairs will be localized to the same socket, thus making
use of chip caches to facilitate communications between the two
(the scheduler changes are now the default in master)
* Finally, default values for many things on 64-bit machines have been
adjusted upward significantly to make proper use of available
resources. There were numerous caps that had been inherited from the
32-bit code that are now gone or greatly raised, particular with
regards to SYSV shared memory and the buffer cache.
The result is an IMMENSE improvement in postgres benchmarks as well as
across-the-board improvements in performance under load. We pretty
much outstrip the other BSDs now and we get fairly close (though do
not quite beat) the higher-end linux benchmarks.
In addition, the new scheduler algorithms effect many other system
activities, such as source code builds (which make heavy use of pipes),
web servers, and even interactive vs batch processing.
Francois will post updated graphs today or tomorrow showing the immense
progress we've made.
<dillon at backplane.com>
More information about the Users