Stability status update 22 Jul 2005

Matthew Dillon dillon at apollo.backplane.com
Fri Jul 22 09:32:39 PDT 2005


    Hello everyone!  Just in case people have been wondering about the
    recent commits, four of us (Peter, David, Tomaz, YONETANI Tomokazu)
    have been making a hard push to stabilize the system on SMP boxes.

    Peter, David, and I have been focusing on an SMP crash that has plagued
    us for over two months, even longer in fact, because we have had a hard
    time getting core dumps out of the SMP systems involved.  Tomaz and
    YONETANI Tomokazu and I have been working on issues with the IPS driver.

    Most of the commits in the last two days have been related to making
    SMP systems behave better during panics.  There have been many commits
    in the last few weeks to add KTR logging to critical subsystems and
    this has improved *invaluable* in helping us track down the bugs.

    In the last week we have made significant progress on the SMP
    crashes.  There turned out to be three significant bugs: the TCP
    sockbuf issues, bugs in the LWKT token code, and a nasty bug in the
    LWKT IPIQ (IPI messaging) code.

    The token code has been fixed.  The TCP code mbuf/sockbuf code is going
    to get a patch commit tonight for stabilization purposes and will then
    be reworked to clean it up.  And, just yesterday, I believe I have
    *finally* found the smoking gun related to the IPIQ crashes.  It looks
    like an index comparison bug was resulting in old IPIQ message entries
    being re-executed on the target cpu.  You can imagine the absolute
    havoc that this would cause on a system(!).

    It takes a good week's worth of testing to detect that particular bug
    because it can only occur in certain heavily-loaded cases when the
    IPIQ's software FIFO fills up, so we won't know if we've nailed it for
    sure for a few days.  However, I will be committing the fix for it
    tonight anyway.

    This isn't to say that the kernel is bug-free, and in fact Tomaz just
    located another (hopefully unrelated) bug.  But once these main line 
    items are fixed I believe we will be well positioned to move forward
    with new work.  The kernel is far better instrumented now then it was
    a month ago, that's for sure!

    If all goes well the 'preview' tag will also be moved early next week.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>





More information about the Kernel mailing list