DragonFly 1.2.6 diskless kernel hang

Matthew Dillon dillon at apollo.backplane.com
Thu Nov 10 12:25:05 PST 2005


:Howdy!	
:
:	I have a FreeBSD 4.10 based webserver cluster/farm thang
:on Intel SMP hardware and am looking for a migration path to a threaded
:kernel. We are considering a DragonFly solution when it becomes production
:ready and would like to participate in its development/testing.
:
:	So .. I built a DragonFly 1.2.6 diskless webserver that
:boots off of a DragonFly Boot server and put it into our front end
:server pool. It NFS mounts all data readonly from FreeBSD 4.10 file
:servers.
:
:
:	It ran fine for 2 weeks then it hung and became unresponsive.
:
:
:	In the log I saw this begining about 1 day before the server
:hang.
:
:
:
:Nov  8 14:42:32 <1.2> 10.1.6.2 kernel: [diagnostic] cache_lock: blocked on 0xd3949bf8 "wedding_pic1.jpg"
:
:	[snip] (every few minutes)
:
:Nov  9 13:29:29 <1.2> 10.1.6.2 kernel: [diagnostic] cache_lock: blocked on 0xd3949bf8 "wedding_pic1.jpg"
:
:	then 
:
:Nov  9 15:46:08 <1.2> 10.1.6.2 kernel: All mbuf clusters exhausted, please see tuning(7).
:Nov  9 15:47:32 <1.2> 10.1.6.2 kernel: All mbuf clusters exhausted, please see tuning(7).
:
:	etc...
:
:
:
:	I have seen some logging of 
:
:
:Nov  9 13:29:29 <1.2> 10.1.6.2 kernel: [diagnostic] cache_lock: blocked on 0x??????? "some_file"
:
:	ocassionally, during the previous weeks, but it was always
:followed by a matching "unblocked" log entry. This time the file is
:never "unblocked" and the server spirals down.

    Yah, you figured it right.  If there is an unblocked entry everything
    is fine, otherwise there's a problem.

    In this case it looks like the system ran out of mbufs, which probably
    stalled the NFS session and prevented the vnode lock from being released.

    The question is... why did the system run out of mbufs?  It could be a
    mbuf leak in the networking code or something else.  The most likely
    cause is an mbuf leak in the networking code.

    It is also possible that the system simply does not have enough mbufs
    configured for the number of parallel connections being handled.  If this
    is the cause then increasing the number of mbuf clusters should solve
    the problem.


:	The server is still up though inaccessable. I have compiled 
:in a debugger and have serial console access. If there is any useful
:information that can be gained/salvaged from this situation I could
:get it with some instruction.
:
:	Or if there are any ideas about DragonFly versions to test
:I would consider them. Eventually, as a first step I would like
:to run DFly with Apache 2.x. on some of our front ends. These are
:completely diskless machines with 3 network interfaces.
:
:	Is DFly ready yet for this type of testing? I know the
:filesystem/disk stuff is still quite new, but what about in
:a simple diskless readonly webserver with 3 NICs? We are
:looking for greater performance than FBSD in an SMP context.
:
:
:	thanx - steve

    I would like to see DragonFly tested in these sorts of situations.
    Being able to run diskless, reliably, is very important.

    Is it possible to hook up a local IDE disk *just* so you can get a 
    crash dump ?  It might be possible to track down the leak.

    It would also be beneficial to try running the latest HEAD (meaning a
    complete reinstall).  A great deal of work has been done in HEAD.  I'm
    hoping that the leak has been fixed and we simply forgot to MFC it.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>





More information about the Kernel mailing list