URGENT: [diagnostic] cache_lock: blocked on 0xda5265a8 ""

YONETANI Tomokazu qhwt+dfly at les.ath.cx
Sat Aug 18 20:09:33 PDT 2007


On Sun, Aug 19, 2007 at 10:29:18AM +1000, elekktretterr at exemail.com.au wrote:
>  UID   PID  PPID CPU PRI  NI   VSZ  RSS WCHAN  STAT  TT       TIME COMMAND
>     0   601     1   0 152   0  3844 2220 select SLs   ??    1:53.42
> sendmail: accepting connections (sendmail)
>    25   605     1   2 153   0  3736 1940 pause  ILs   ??    0:01.79
> sendmail: Queue runner at 00:30:00 for /var/spool/clientmqueue (sendmail)
>     0   625     1   0 152   0  1240  784 nanslp ILs   ??    0:11.96
> /usr/sbin/cron
				:
>     0 98882     1   0 152   0  3296 1756 select SLs   ??    4:02.07
> /usr/pkg/libexec/postfix/master
>  1004 98884 98882   0 152   0  3432 1908 select SL    ??    0:59.52 qmgr
> -l -t fifo -u

Probably unrelated to the `cache blocked' problem, but do you run
sendmail and postfix at the same time?

> >      To really debug the problem you need to generate a kernel panic and
> >      kernel core so we can track down where the problem occured.  This is
> >      usually accomplished by dropping it into the debugger and manually
> >      panicing the system.  swap has to be at least as large as main memory
> >      and dumpdev has to be set in /etc/rc.conf.  e.g:
> >
> >      dumpdev="/dev/ad0s1b"
> 
> yes, but will ssh not die if i drop into the debugger and manually panic
> the system? in other words, if i panic it and ssh dies, is there any way
> to restart the system while not at the physical location of the server? im
> most afraid of basically the system hanging on syncing disks while doing a
> soft reboot, as it has done previously when this problem happened and then
> someone has to manually push the button.

I wrote a patch some time ago to dump core without panicking
  http://les.ath.cx/DragonFly/dumpnow.diff.gz

It starts dumping if you set `sysctl debug.dumpnow=1'.  The network and
other things become unresponsive during the dump(as they do during the
panic) If dumping takes more than several minutes, you probably need to
make another ssh connection to the server.

Of course you're advised to test this patch on a spare machine with
identical equipment (same controller, mainboard, ...) before using it on
the production machine if it's a big problem for you to have the power
button pressed.

BTW this feature doesn't work if the securelevel > 0.  You can however make
it work by removing CTLFLAG_SECURE from the following chunk in the patch.

+SYSCTL_PROC(_debug, OID_AUTO, dumpnow,
+           CTLTYPE_INT | CTLFLAG_WR | CTLFLAG_SECURE, &dumpnow, 0,
+           sysctl_debug_dumpnow, "I", "call dumpsys() now");

Cheers.





More information about the Kernel mailing list