HAMMER lockup

Matthew Dillon dillon at apollo.backplane.com
Mon Jun 30 19:44:57 PDT 2008


    Ok, please test with the latest kernel & HAMMER commits.

    (Make backups of any critical data beforehand, just in case)

    The main issue I tracked down was a call path from the pageout
    daemon.  The pageout daemon packages up pages into a UIO_NOCOPY
    VOP_WRITE.  The filesystem then typically accesses the pages
    via the buffer cache and so calls getblk().

    Somtimes the getblk() covers more pages then the pageout daemon
    packaged up, requiring the additional pages to be allocated.
    This occurs more often with HAMMER because it uses a 64K block
    size when writing out large files.

    If the system has insufficient memory and these allocations fail,
    the pageout daemon can deadlock.

    The main fix I made is to allow the allocation of VM pages on
    behalf of the buffer cache to dig into the interrupt reserve,
    and to also attempt to free some VM pages from other clean buffers
    in the buffer cache.  When combined with the bwillwrite() work, which
    tries to guarantee that no more then half the buffers in the
    buffer cache are ever dirty, this *theoretically* should guarantee
    that getblk() calls made by a filesystem will never have to block on
    the VM system when allocating VM pages.

    With these fixes the console may spew out messages like this:

	"bio_page_alloc: WARNING emergency page allocation"

    Which basically means 'I had to undertake the above emergency measures
    to avoid a potential deadlock'.  That's ok, and I will remove the
    message before the release.  

    However, if the system spews out this:

	"bio_page_alloc: WARNING emergency page allocation failed"

    It means that my emergency measures failed to prevent the potential
    deadlock.  The code no longer just sleeps forever though, it will
    continue to retry so the system may be able to recover from the
    situation, but my goal is for the above 'blah blah blah failed' message
    to never occur no matter what the situation.

    If I have managed to fix the buffer cache to not block on the VM system
    it will break the spiral of death that ends in a deadlock.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Users mailing list