Understanding the idlezero code

Wed Sep 22 10:41:32 PDT 2010

Hi,

A feature added to DragonFly during the 2.8 development cycle was
idle-time page zeroing. Stated simply, the system will use some of its
idle time to zero free pages, possibly saving time when they are
allocated. Walking through the idle zero code is instructive - it provides
a view into a number of DragonFly kernel subsystems.
Some background:

The DragonFly (and FreeBSD) virtual memory systems are organized around a
number of queues, describing all of the page frames in a system. The
queues are:
        active := Pages that are actively mapped and in use
        inactive := Pages that are dirty; these may be mapped, but will be
                    the reclaimed under memory pressure
        cache := Pages that are clean and reusable, but still hold their
                 contents until needed under pressure
        free := Pages not actively holding data, ready for allocation
The cache and free queues are actually divided into a number of
sub-queues, one for each cache color, but they function as single queues.
They are also loosely sorted, with zeroed pages at the tail.
Page allocation requests, for example by a user process zero-fill fault,
need pages of zeroes. The fault handler code will call vm_page_alloc
(found in /usr/src/sys/vm/vm_page.c) with the VM_ALLOC_ZERO flag set,
which will take a page from the tail of free queue, if available. If the
page was not already zeroed, it will be, (by the caller!). Having zeroed
pages around would save that time.
In DragonFly, the idle zero logic runs in its own LWKT, which runs at
system idle time. The LWKT is somewhat atypical - it works pretty hard to
get out of the way, at costs to its own idle zero rate. (In FreeBSD 4.x,
it ran as part of the idle loop; in 6.x+, it runs in its own kernel
thread).
Code time:

The code is in /usr/src/sys/vm/vm_zeroidle.c
(http://grok.x12.su/source/xref/dragonfly/sys/vm/vm_zeroidle.c) if you'd
like to follow along.
Typical of such walkthroughs, we will start the very last line of the
file:
SYSINIT(pagezero, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, pagezero_start, NULL);
SYSINIT is a DFly/FBSD kernel macro, which marks a function to be called
during boot. This SYSINIT invocation is saying 'call the function
pagezero_start, when starting the VM daemons (SI_SUB_KTHREAD_VM), at any
point during the VM daemon startup (SI_ORDER_ANY), with NULL args'.
The pagezero_start function, just above the SYSINIT invocation, looks
like (simplified):
static void pagezero_start(void __unused *arg) {
      struct thread *td;
      idlezero_nocache = bzeront_avail;
      kthread_create(vm_pagezero, NULL, &td, "pagezero");
}
This function captures a flag from the platform specific code - is the
bzeront function available (on SSE2 i386 systems, we use the MOVNTI
instruction to zero pages, avoiding polluting a processor's Data Cache
with lots of zeroes; this flag indicates whether MOVNTI is available). The
function then kicks off an LWKT, named 'pagezero', running the vm_pagezero
function. The LWKT starts up with the MP lock held.
The vm_pagezero() function, lurking just above in this file, is the core
of the idle zero logic. It performs some setup work:
        > lwkt_setpri_self(TDPRI_IDLE_WORK);
        > lwkt_setcpu_self(globaldata_find(ncpus - 1));
Setting its priority to just above the idle thread and moving itself to
the last CPU on the system. It then enters its main loop.
The idle zero main loop is constructed as a state machine, with a few
states - IDLE, GET_PAGE, ZERO_PAGE, and RELEASE_PAGE. The main loop 
switches on the current state executes a small block of code, then 
transitions states.  At each transition, it calles lwkt_yield(), to switch 
to any ready LWKTs on the current CPU.

The idle state is the state that the logic starts in:
        > case STATE_IDLE:
        >       tsleep(&zero_state, 0, "pgzero", sleep_time);
        >       if (vm_page_zero_check())
        >               npages = idlezero_rate / 10;
        >       sleep_time = vm_page_zero_time();
        >       if (npages)
        >               state = STATE_GET_PAGE;
        >       break;
In the idle state, the idle zero LWKT sleeps for 'sleep_time'; when there 
are no pages to zero, sleep_time is a long interval - 'LONG_SLEEP_TIME', 
or ten time the system clock; when there are, we sleep for 
'DEFAULT_SLEEP_TIME', a tenth of the system clock. When the LWKT wakes 
from its sleep, it calls vm_page_zero_check(), also in this file; 
vm_page_zero_check() will be described later, but it returns true if we 
should be zeroing pages. If so, we compute the number of pages to zero, 
how long to sleep on the next entry to the idle state, and transition to 
the GET_PAGE state. We break between transitions, to attempt lwkt_yield() 
again.

The GET_PAGE state logic looks like:
        > case STATE_GET_PAGE:
        >       m = vm_page_free_fromq_fast();
        >       if (m == NULL) {
        >               state = STATE_IDLE;
        >       } else {
        >               state = STATE_ZERO_PAGE;
        >               buf = lwbuf_alloc(m);
        >               pg = (char *)lwbuf_kva(buf);
        >       }
        >       break;
In GET_PAGE state we attempt to acquire a page to zero, using a relatively 
new interface, vm_page_free_fromq_fast(). This routine, in vm_page.c, 
attempts to get a page from one of the free queues. If it fails to get 
one, we return to the idle state; otherwise, we prepare to entire the 
ZERO_PAGE state. We allocate an lwbuf and bind it to the page we wish to 
zero.

In the ZERO_PAGE state, we actually zero the page:
        > case STATE_ZERO_PAGE:
        >       while (i < PAGE_SIZE) {
        >               if (idlezero_nocache == 1)
        >                       bzeront(&pg[i], IDLEZERO_RUN);
        >               else
        >                       bzero(&pg[i], IDLEZERO_RUN);
        >               i += IDLEZERO_RUN;
        >               lwkt_yield();
        >       }
        >       state = STATE_RELEASE_PAGE;
        >       break;
We loop across the entire page, zeroing 64-bytes at a time. After each 
64-byte run, we lwkt_yield(), if any LWKTs are waiting to run. If the 
MOVNTI instruction is available, we use it via bzeront(); otherwise, we 
use bzero(). When we are done zeroing the page, we enter the RELEASE_PAGE 
state.

In the RELEASE_PAGE state, we tear down the lwbuf and return the page to 
the free queue:
        >       case STATE_RELEASE_PAGE:
        >               lwbuf_free(buf);
        >               vm_page_flag_set(m, PG_ZERO);
        >               vm_page_free_toq(m);
        >               state = STATE_GET_PAGE;
        >               ++idlezero_count;
        >               break;

We first release the lwbuf; we then mark the page as zeroed and return it 
to the free queue. We transition back to the GET_PAGE state, and bump an 
idlezero counter.

The operation of the idle zero code can be monitored via sysctls - the 
sysctl vm.stats.vm.v_ozfod tracks the total number of zero-fill faults 
which found a zero-filled page waiting for them (vm.stats.vm.v_zfod tracks 
total zfod faults). The vm.idlezero_count tracks the total number of pages 
the idle zero logic has managed to zero-fill.

Hopefully this was interesting,
-- vs