Understanding the idlezero code
Tim Darby
t+dfbsd at timdarby.net
Thu Sep 23 08:41:58 PDT 2010
Very interesting, thanks!Tim
On Wed, Sep 22, 2010 at 10:36 AM, Venkatesh Srinivas <me at endeavour.zapto.org> wrote:
Hi,
A feature added to DragonFly during the 2.8 development cycle was
idle-time page zeroing. Stated simply, the system will use some of its
idle time to zero free pages, possibly saving time when they are
allocated. Walking through the idle zero code is instructive - it provides
a view into a number of DragonFly kernel subsystems.
Some background:
The DragonFly (and FreeBSD) virtual memory systems are organized around a
number of queues, describing all of the page frames in a system. The
queues are:
active := Pages that are actively mapped and in use
inactive := Pages that are dirty; these may be mapped, but will be
the reclaimed under memory pressure
cache := Pages that are clean and reusable, but still hold their
contents until needed under pressure
free := Pages not actively holding data, ready for allocation
The cache and free queues are actually divided into a number of
sub-queues, one for each cache color, but they function as single queues.
They are also loosely sorted, with zeroed pages at the tail.
Page allocation requests, for example by a user process zero-fill fault,
need pages of zeroes. The fault handler code will call vm_page_alloc
(found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set,
which will take a page from the tail of free queue, if available. If the
page was not already zeroed, it will be, (by the caller!). Having zeroed
pages around would save that time.
In DragonFly, the idle zero logic runs in its own LWKT, which runs at
system idle time. The LWKT is somewhat atypical - it works pretty hard to
get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x,
it ran as part of the idle loop; in 6.x+, it runs in its own kernel
thread).
Code time:
The code is in /usr/src/sys/vm/vm_zeroidle.c
(http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c) if you'd
like to follow along.
Typical of such walkthroughs, we will start the very last line of the
file:
SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL);
SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called
during boot. This SYSINIT invocation is saying 'call the function
pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any
point during the VM daemon startup (SI_ORDER_ANY), with NULL args'.
The pagezero_start function, just above the SYSINIT invocation, looks
like (simplified):
static void pagezero_start(void __unused *arg) {
struct thread *td;
idlezero_nocache = bzeront_avail;
kthread_create(vm_pagezero, NULL, &td, "pagezero");
}
This function captures a flag from the platform specific code - is the
bzeront function available (on SSE2 i386 systems, we use the MOVNTI
instruction to zero pages, avoiding polluting a processor's Data Cache
with lots of zeroes; this flag indicates whether MOVNTI is available). The
function then kicks off an LWKT, named 'pagezero', running the vm_pagezero
function. The LWKT starts up with the MP lock held.
The vm_pagezero() function, lurking just above in this file, is the core
of the idle zero logic. It performs some setup work:
> lwkt_setpri_self(TDPRI_IDLE_WORK);
> lwkt_setcpu_self(globaldata_find(ncpus - 1));
Setting its priority to just above the idle thread and moving itself to
the last CPU on the system. It then enters its main loop.
The idle zero main loop is constructed as a state machine, with a few
states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop switches on the current state executes a small block of code, then transitions states. At each transition, it calles lwkt_yield(), to switch to any ready LWKTs on the current CPU.
The idle state is the state that the logic starts in:
> case STATE_IDLE:
> tsleep(&zero_state, 0, "pgzero", sleep_time);
> if (vm_page_zero_check())
> npages = idlezero_rate / 10;
> sleep_time = vm_page_zero_time();
> if (npages)
> state = STATE_GET_PAGE;
> break;
In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when there are no pages to zero, sleep_time is a long interval - 'LONG_SLEEP_TIME', or ten time the system clock; when there are, we sleep for 'DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes from its sleep, it calls vm_page_zero_check(), also in this file; vm_page_zero_check() will be described later, but it returns true if we should be zeroing pages. If so, we compute the number of pages to zero, how long to sleep on the next entry to the idle state, and transition to the GET_PAGE state. We break between transitions, to attempt lwkt_yield() again.
The GET_PAGE state logic looks like:
> case STATE_GET_PAGE:
> m = vm_page_free_fromq_fast();
> if (m == NULL) {
> state = STATE_IDLE;
> } else {
> state = STATE_ZERO_PAGE;
> buf = lwbuf_alloc(m);
> pg = (char *)lwbuf_kva(buf);
> }
> break;
In GET_PAGE state we attempt to acquire a page to zero, using a relatively new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c, attempts to get a page from one of the free queues. If it fails to get one, we return to the idle state; otherwise, we prepare to entire the ZERO_PAGE state. We allocate an lwbuf and bind it to the page we wish to zero.
In the ZERO_PAGE state, we actually zero the page:
> case STATE_ZERO_PAGE:
> while (i < PAGE_SIZE) {
> if (idlezero_nocache == 1)
> bzeront(&pg[i], IDLEZERO_RUN);
> else
> bzero(&pg[i], IDLEZERO_RUN);
> i += IDLEZERO_RUN;
> lwkt_yield();
> }
> state = STATE_RELEASE_PAGE;
> break;
We loop across the entire page, zeroing 64-bytes at a time. After each 64-byte run, we lwkt_yield(), if any LWKTs are waiting to run. If the MOVNTI instruction is available, we use it via bzeront(); otherwise, we use bzero(). When we are done zeroing the page, we enter the RELEASE_PAGE state.
In the RELEASE_PAGE state, we tear down the lwbuf and return the page to the free queue:
> case STATE_RELEASE_PAGE:
> lwbuf_free(buf);
> vm_page_flag_set(m, PG_ZERO);
> vm_page_free_toq(m);
> state = STATE_GET_PAGE;
> ++idlezero_count;
> break;
We first release the lwbuf; we then mark the page as zeroed and return it to the free queue. We transition back to the GET_PAGE state, and bump an idlezero counter.
The operation of the idle zero code can be monitored via sysctls - the sysctl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults which found a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks total zfod faults). The vm.idlezero_count tracks the total number of pages the idle zero logic has managed to zero-fill.
Hopefully this was interesting,
-- vs
More information about the Kernel
mailing list