DragonFly 1.2.6 diskless kernel hang

Thu Nov 10 12:59:12 PST 2005

On Thu, Nov 10, 2005 at 12:15:36PM -0800, Matthew Dillon wrote:
> :
> :	ocassionally, during the previous weeks, but it was always
> :followed by a matching "unblocked" log entry. This time the file is
> :never "unblocked" and the server spirals down.
> 
>     Yah, you figured it right.  If there is an unblocked entry everything
>     is fine, otherwise there's a problem.
> 
>     In this case it looks like the system ran out of mbufs, which probably
>     stalled the NFS session and prevented the vnode lock from being released.

	The "cache_block: blocked on"  stuff is logged for about a day.
Then the mbuf exhaustion appears. From that point on, there are no
more "cache_block: " references.

	So I'm figuring that mbuf exhaustion is a symptom not a cause.

> 
>     The question is... why did the system run out of mbufs?  It could be a
>     mbuf leak in the networking code or something else.  The most likely
>     cause is an mbuf leak in the networking code.
> 
>     It is also possible that the system simply does not have enough mbufs
>     configured for the number of parallel connections being handled.  If this
>     is the cause then increasing the number of mbuf clusters should solve
>     the problem.


	There is 32k mbuf clusters and 131072 mbufs.  Same as
the FBSD 4.10 machines in the pool. The pool has about 150 days
uptime and the FBSD machines have about 20k peak mbufs/mbuf clusters.

> 
>     I would like to see DragonFly tested in these sorts of situations.
>     Being able to run diskless, reliably, is very important.
> 
>     Is it possible to hook up a local IDE disk *just* so you can get a 
>     crash dump ?  It might be possible to track down the leak.
> 


	No this machine is only big enough for a motherboard and
power supply. But if I switch to HEAD, would it be better to use
a machine with a disk just for crash dumps?


>     It would also be beneficial to try running the latest HEAD (meaning a
>     complete reinstall).  A great deal of work has been done in HEAD.  I'm
>     hoping that the leak has been fixed and we simply forgot to MFC it.
> 

	Seems running HEAD is the prefferable next step.

	I'm assuming that I can build HEAD on 1.2.6, run HEAD on the
webserver and boot/run from a 1.2.6 bootserver no problem. The NFS mounted
webserver base system is independent of the bootservers base.

	Is there any particular point in HEAD's devel that I should
get, or is the latest ok?


	-steve