Some libc malloc work...
Venkatesh Srinivas
me at endeavour.zapto.org
Wed Mar 17 08:27:41 PDT 2010
Hi world,
I've spent some time in the last few weeks working on the libc malloc,
this is just a note about the work I've done.
The existing nmalloc handles multithreaded applications by having four
SLGlobalData structures, each of which controls an independent set of
allocation zones. Each SLGlobaldata structure has a spinlock, protecting
all of the associated zones. On allocation, a thread attempts to lock the
last SLGlobalData structure it allocated from; if the lock is already
held, it moves on to the next global data structure and uses _SPINLOCK
rather than _SPINTRYLOCK this time.
Each SLGlobalData structure also contains a small (4-element) cache of
free zones; this is meant to avoid having to call _vmem_alloc (ultimately
mmap) when recently free zones are available. Since this cache is
per-SLGD, its possible that there will be mmap requests while other SLGDs
still have free zones, but there is a strict bound on this sort of
behavior. The free zone list is only populated when a zone becomes free; a
burst of different-sized allocations will end up bypassing it and calling
mmap a number of times.
I tried to improve these two areas; my work is available at
http://endeavour.zapto.org/dfly/nmalloc.c. (the diff is also there, at
nmalloc.c.diff, if you prefer).
Here's what I've changed:
I implemented a magazine layer, in the style of Solaris's libumem; each
thread has pointers to two magazines, 'loaded' and 'previous', per zone
size (struct thr_mags is the structure; it is accessed via an __thread
variable, thread_mags). In addition to the __thread variable, there is a
pthread_key associated with thread_mags; we don't use the
pthread_get_specific mechanism to access the thread_mags, but we do have
this key so that a destructor, mtmagazine_destructor(), is called on
thread exit, to drain all thread-specific resources.
There are per-Zone Depot structures, which are lists of full and empty
magazines associated with a size class, and a pthread_spin_lock. I use a
pthread_spin_lock rather than a spinlock_t since there is a static number
of spinlocks available, and using spinlock_t here will use many of them
(it works with libthread_xu here, but I did development on FreeBSD 6.3
with libpthread, which runs out). Unlike the Solaris libumem, there is no
contention counter or counts on the size of the free/full lists; I haven't
implemented magazine scaling, so they weren't necessary.
I removed the existing MT strategy; there is only one SLGlobalData
structure.
The multithreaded path for allocation now looks like:
_slaballoc calls mtmagazine_alloc(); if there is a loaded magazine for the
current size class (via zoneindex), it allocates the first object from
that magazine and returns. If that magazine is empty, it swaps in a full
prev magazine and retries; It then attempts to lock and allocate a full
magazine from the depot for a given size; if it cannot, it gives
_slaballoc a chance to handle the allocation as it always has. (For more
details on why the two magazines, see the Solaris libumem paper).
_slabfree calls mtmagazine_free, as expected. Its path is pretty much
symettric with mtmagazine_alloc.
To try to reduce the number of mmaps, I added one more magazine, a cache
of zones; when a zone becomes free, I now insert it into zone_magazine,
from zone_free, and zone_alloc attempts to pull from this cache before
hitting mmap. The zone magazine has a flag, M_BURST, and another,
M_BURST_EARLY, set; when M_BURST is set and a magazine fails to allocate
from itself, it calls the underlying allocator (_vmem_alloc here) to
allocate a number of objects, initially 8. M_BURST_EARLY causes that 8 to
then be scaled down by NSCALE each time, till it hits one. This behavior
was meant to help with program startup and bursts of uneven allocation
sizes that aren't connected with the steady-state behavior of a program.
The zone magazine, when full, sheds load as well, releasing via _vmem_free
until low_factor (currently 48) is reached. All of this code is in
zone_alloc / zone_free.
A few quick tests - I captured all of the allocation traffic during the
startup of gnome-terminal and the epiphany browser; these are available in
gt.c and ep.c in my home directory on leaf; warning, compiling ep.c causes
gcc to eat ~4.6GB of RAM, so be careful.
Gnome-terminal makes ~85000 allocations/frees on startup; when run with
the existing nmalloc, this results in 69 mmaps, 54 for zone structures,
and 11 munmaps. When my patched version is used, there are 66 mmaps, 11
munmaps; not a major difference. When M_BURST is raise to 16 and the
allocator scales by subtraction rather than division (16, 14....), we drop
to 22 mmaps and 9 munmaps (for comparison, FreeBSD's jemalloc calls mmap
10 times and munmap 4). I'll post the Epiphany numbers a little later
today; also the virt/res sizes for each.
I've not done any particular MT scaling measurements; I used the
malloc-test tool, from Lever & Boreham (referenced from the jemalloc
paper) and got an excellent upper bound on scalability, but that test is a
fairly cheesy one. Any suggestions for good things to try would be pretty
cool.
Any comments on the code or ideas would rock...
Thanks,
-- vs
More information about the Kernel
mailing list