Some libc malloc work...

Venkatesh Srinivas me at endeavour.zapto.org
Wed Mar 17 08:27:41 PDT 2010


Hi world,

I've spent some time in the last few weeks working on the libc malloc, 
this is just a note about the work I've done.

The existing nmalloc handles multithreaded applications by having four 
SLGlobalData structures, each of which controls an independent set of 
allocation zones. Each SLGlobaldata structure has a spinlock, protecting 
all of the associated zones. On allocation, a thread attempts to lock the 
last SLGlobalData structure it allocated from; if the lock is already 
held, it moves on to the next global data structure and uses _SPINLOCK 
rather than _SPINTRYLOCK this time.

Each SLGlobalData structure also contains a small (4-element) cache of 
free zones; this is meant to avoid having to call _vmem_alloc (ultimately 
mmap) when recently free zones are available. Since this cache is 
per-SLGD, its possible that there will be mmap requests while other SLGDs 
still have free zones, but there is a strict bound on this sort of 
behavior. The free zone list is only populated when a zone becomes free; a 
burst of different-sized allocations will end up bypassing it and calling 
mmap a number of times.

I tried to improve these two areas; my work is available at 
http://endeavour.zapto.org/dfly/nmalloc.c. (the diff is also there, at 
nmalloc.c.diff, if you prefer).
Here's what I've changed:

I implemented a magazine layer, in the style of Solaris's libumem; each 
thread has pointers to two magazines, 'loaded' and 'previous', per zone 
size (struct thr_mags is the structure; it is accessed via an __thread 
variable, thread_mags). In addition to the __thread variable, there is a 
pthread_key associated with thread_mags; we don't use the 
pthread_get_specific mechanism to access the thread_mags, but we do have 
this key so that a destructor, mtmagazine_destructor(), is called on 
thread exit, to drain all thread-specific resources.

There are per-Zone Depot structures, which are lists of full and empty 
magazines associated with a size class, and a pthread_spin_lock. I use a 
pthread_spin_lock rather than a spinlock_t since there is a static number 
of spinlocks available, and using spinlock_t here will use many of them 
(it works with libthread_xu here, but I did development on FreeBSD 6.3 
with libpthread, which runs out). Unlike the Solaris libumem, there is no 
contention counter or counts on the size of the free/full lists; I haven't 
implemented magazine scaling, so they weren't necessary.

I removed the existing MT strategy; there is only one SLGlobalData 
structure.

The multithreaded path for allocation now looks like:

_slaballoc calls mtmagazine_alloc(); if there is a loaded magazine for the 
current size class (via zoneindex), it allocates the first object from 
that magazine and returns. If that magazine is empty, it swaps in a full 
prev magazine and retries; It then attempts to lock and allocate a full 
magazine from the depot for a given size; if it cannot, it gives 
_slaballoc a chance to handle the allocation as it always has. (For more 
details on why the two magazines, see the Solaris libumem paper).
_slabfree calls mtmagazine_free, as expected. Its path is pretty much 
symettric with mtmagazine_alloc.

To try to reduce the number of mmaps, I added one more magazine, a cache 
of zones; when a zone becomes free, I now insert it into zone_magazine, 
from zone_free, and zone_alloc attempts to pull from this cache before 
hitting mmap. The zone magazine has a flag, M_BURST, and another, 
M_BURST_EARLY, set; when M_BURST is set and a magazine fails to allocate 
from itself, it calls the underlying allocator (_vmem_alloc here) to 
allocate a number of objects, initially 8. M_BURST_EARLY causes that 8 to 
then be scaled down by NSCALE each time, till it hits one. This behavior 
was meant to help with program startup and bursts of uneven allocation 
sizes that aren't connected with the steady-state behavior of a program. 
The zone magazine, when full, sheds load as well, releasing via _vmem_free 
until low_factor (currently 48) is reached. All of this code is in 
zone_alloc / zone_free.

A few quick tests - I captured all of the allocation traffic during the 
startup of gnome-terminal and the epiphany browser; these are available in 
gt.c and ep.c in my home directory on leaf; warning, compiling ep.c causes 
gcc to eat ~4.6GB of RAM, so be careful.

Gnome-terminal makes ~85000 allocations/frees on startup; when run with 
the existing nmalloc, this results in 69 mmaps, 54 for zone structures, 
and 11 munmaps. When my patched version is used, there are 66 mmaps, 11 
munmaps; not a major difference. When M_BURST is raise to 16 and the 
allocator scales by subtraction rather than division (16, 14....), we drop 
to 22 mmaps and 9 munmaps (for comparison, FreeBSD's jemalloc calls mmap 
10 times and munmap 4). I'll post the Epiphany numbers a little later 
today; also the virt/res sizes for each.

I've not done any particular MT scaling measurements; I used the 
malloc-test tool, from Lever & Boreham (referenced from the jemalloc 
paper) and got an excellent upper bound on scalability, but that test is a 
fairly cheesy one. Any suggestions for good things to try would be pretty 
cool.

Any comments on the code or ideas would rock...

Thanks,
-- vs




More information about the Kernel mailing list