Update on FS corruption issues

Matthew Dillon dillon at apollo.backplane.com
Tue May 11 21:53:06 PDT 2004


    So far still no luck reproducing the problem, but I have done a major
    review of the VM system and I actually found some races which I am
    addressing tonight.

    The primary problem appears to be a lack of splvm/splbio protection
    on vm_page_lookup() calls.   Interrupts can free pages, removing them
    from the associated VM object.

    A good chunk of the code we inherited from FreeBSD seems to assume
    that it is sufficient to do this:

	m = vm_page_lookup(...)
	if (vm_page_sleep_busy(...))
	    goto try_again
	vm_page_wire(...);

    The assumption here is that an interrupt can only free a PG_BUSY page
    (as part of the termination of an I/O), so if the page is not PG_BUSY
    it is safe to play with without spl protection.  This is probably true,
    but unfortunately there is still a race between the vm_page_lookup() call
    and the vm_page_sleep_busy() check.  If an interrupt occurs just after
    the lookup completes but before we check PG_BUSY, it can unbusy and
    free the page out from under us and we will be left with a broken page.

    This issue has been with us for a very long time, so I do not think there
    is a high probability of it being related to the FS corruption problem
    we are having.  But it's possible that something in the new code is making
    these interrupt windows larger then the one or two instructions they
    were in FreeBSD.

						    -Matt






More information about the Kernel mailing list