Malloc scaling/speedup work, request for testers/reviews

Sun Apr 18 15:15:26 PDT 2010

Hiya,

I've been working for some time on improving the libc malloc in
DragonFly, specifically improving its thread scaling and reducing the
number of mmap/munmap system calls it issues.

To address the issue of scaling, I added per-thread magazines for
allocations < 8K (ones that would hit the current slab zones); the
magazine logic is based straight on Bonwick and Adams's 2001
'Magazines and Vmem' paper, about the Solaris kernel allocator and
libumem. To address the number of mmaps the allocator was making, I
added a single magazine between the slab layer and mmap - it caches up
to 64 zones and will reuse them rather than requesting/releasing to
the system. In addition, I made the first request for a zone allocate
not one, but 8 zones, the second will allocate 6, so on and on, till
we have stabilized at allocating one-at-a-time. This logic was meant
to deal with programs issuing requests for different-sized objects
early on in their life.

Some benchmark results so far:

sh6bench =============================
To quote Jason Evans, 'is a quirky malloc benchmark that has been used
in some of the published malloc literature, so I include it here.
sh6bench does cycles of allocating groups of objects, where the size
of objects in each group is random. Sometimes the objects are held for
a while before being freed.' The test is available at:
http://m-net.arbornet.org/~sv5679/sh6bench.c

When run on DragonFly, with 50000 calls for objects between 1 and 1512
bytes, nmalloc (the current libc allocator) takes 85sec to finish the
test; nmalloc-1.33 takes 58s, spending nearly 20 sec less in system
time.

When tested with 2500 calls, for 1...1512 byte objects on FreeBSD 8,
nmalloc, nmalloc 1.33, and jemalloc (the FreeBSD libc allocator) turn
in times very close to one another.
Here are the total memory uses and mmap call counts:
(nmalloc 'g' is nmalloc with the per-thread caches disabled).

                         mmaps / munmaps		total space requested/release     diff
nmalloc		1582 / 438		                107,598,464 b / 29,220,560 b
78,377,904 b
nmalloc 1.33	1154 / 9		                81,261,328 b / 852,752 b		80,408,576 b
nmalloc 1.33 'g'	1148 / 9		                81,130,256 b / 852,752
b		80,277,504 b
jemalloc	         45 / 4		       	        82,718,411 b / 1,609,424
b		81,108,987 b

I also graphed the allocation structure using Jason Evans's mtrgraph
tool (nmalloc 1.33 can generate utrace events).
nmalloc: http://acm.jhu.edu/~me/heap/sh6bench_nmalloc.png
nmalloc 1.33: http://acm.jhu.edu/~me/heap/sh6bench_nmalloc133.png
jemalloc: http://acm.jhu.edu/~me/heap/sh6bench_jemalloc.png

The horizontal axis of this graph is time in terms of allocation/free
events. The vertical is
address space, split into buckets. Similar traces were generated when
jemalloc was in development:
http://people.freebsd.org/~jasone/jemalloc/plots/