dillon at apollo.backplane.com
Tue Nov 10 08:46:45 PST 2009
:ever since the last time I had CRC problems on my router box, I've
:developed the habit of doing a daily 'hammer -f /dev/ad4s1d show |& grep
:"^B"' to see if any new errors crept up, and today I found:
:yoyodyne# hammer -f /dev/ad4s1d show |& grep "^B"
:B dataoff=a00000714d120000/65536 crc=7e4f7545
:B dataoff=a000007171380000/65536 crc=616b1cc1
The question is whether it is real or not. If the filesystem is
mounted live then the show command could be catching things in
:Console log for the recent days is:
:Nov 7 03:15:19 <kern.crit> yoyodyne kernel: HAMMER: Warning: rebalance
:caught race against propagate
None of those are serious. Basically just debug messages that will
be removed soon. The emergency page allocation for BIO is unrelated
to the filesystem code. It's also actually just a warning (telling me
that something is eating too many free VM pages).
:So my question is: What are my next steps in order to help resolve this
:issue? Is there any way to get e.g. to the names of the files affected
:by this problem from the data which is output by 'hammer show'?
:So far the only thing I've done is to disable nightly hammer cleanup
:because DragonFly, upon encountering a CRC error, will unfortunately
:simply drop to the debugger without panicing, so this doesn't get caught
:by DDB_UNATTENDED as far as I can tell (Matt, are there any plans to
:change this unpleasant behavior?). And I won't be near that box until
I fixed the behavior in current. There is now a sysctl which
controls whether it drops into the debugger or not (and it does not
by default). Though it doesn't panic... maybe the sysctl should be
modified to give it the ability to panic instead of propagating an
error code up the call chain. The filesystem still drops into
read-only mode if an error is encountered.
What you want to do now is run 'hammer -f ... show | less -B' and
search for B, as in '/^B'. less -B uses a fixed buffer so if you
scroll down you basically cannot scroll back up (by much), which allows
you to pipe gigabytes and gigabytes of text through it without it
malloc()ing itself into oblivion. You want to try to find the problem
area and get more context out of it, such as the object id. And also
to determine whether the problem area is real or not.
Again the filesystem has to be idle and it would be even better if it
were offline entirely.
<dillon at backplane.com>
More information about the Kernel