Update on FS corruption issues
Matthew Dillon
dillon at apollo.backplane.com
Tue May 11 21:53:06 PDT 2004
So far still no luck reproducing the problem, but I have done a major
review of the VM system and I actually found some races which I am
addressing tonight.
The primary problem appears to be a lack of splvm/splbio protection
on vm_page_lookup() calls. Interrupts can free pages, removing them
from the associated VM object.
A good chunk of the code we inherited from FreeBSD seems to assume
that it is sufficient to do this:
m = vm_page_lookup(...)
if (vm_page_sleep_busy(...))
goto try_again
vm_page_wire(...);
The assumption here is that an interrupt can only free a PG_BUSY page
(as part of the termination of an I/O), so if the page is not PG_BUSY
it is safe to play with without spl protection. This is probably true,
but unfortunately there is still a race between the vm_page_lookup() call
and the vm_page_sleep_busy() check. If an interrupt occurs just after
the lookup completes but before we check PG_BUSY, it can unbusy and
free the page out from under us and we will be left with a broken page.
This issue has been with us for a very long time, so I do not think there
is a high probability of it being related to the FS corruption problem
we are having. But it's possible that something in the new code is making
these interrupt windows larger then the one or two instructions they
were in FreeBSD.
-Matt
More information about the Kernel
mailing list