No subject

Unknown Unknown
Fri Apr 16 15:05:24 PDT 2010

From: Matthew Dillon <dillon at>
Subject: Re: I/O errors on Hammer volume
Date: Fri, 16 Apr 2010 15:02:54 -0700 (PDT)
List-Post: <mailto:kernel at>
List-Subscribe: <mailto:kernel-request at>
List-Unsubscribe: <mailto:kernel-request at>
List-Help: <mailto:kernel-request at>
List-Owner: <mailto:owner-kernel at>
References: <20100413063525.GA3297 at> <x2u95250f81004130236m4e57fc54x4718da8affed3282 at> <20100413105442.GA3683 at> <201004131849.o3DIn3X2060284 at> <20100414053738.GB10928 at> <201004140602.o3E625ZE066170 at> <20100414062504.GA11007 at> <201004151848.o3FImQA3095944 at> <20100415190538.GA13123 at> <201004162119.o3GLJuRH022916 at apollo.backplane.c
Sender: kernel-errors at
Errors-To: kernel-errors at
Lines: 95
X-Trace: 1271455662 9058
Xref: dragonfly.kernel:14283

    I think I found the smoking gun but I can't be sure until I see
    the show output from Francois.

    I noticed that all the bad CRC'd records were typically around element
    30-32 in the B-Tree node (out of ~64 elements).  That is, the middle
    of the node.

    This implies a race between the reblocker/rebalancer and a node split
    during an insertion, or a race between the reblocker and the rebalancer.

    I am testing a fix now and I am not 100% sure that this was the issue,
    but there are a lot of things pointing to it:

    * In both Jan's and Francois's cases the inodes that got corrupted
      were in areas of the filesystem under a heavy write/create/delete

    * The corrupted records appear to nearly always be in the middle of
      the B-Tree node, which implies a race against an insertion or a
      rebalancing operation while the reblocker is running.

    * And I found a bug in the reblocker that was exposed by recent work
      (the work itself was not buggy, it just exposed the bug that already
      existed) whereby the reblocker may reblock an element after relocking
      the node but without properly checking that the element is still valid.

    Jan, I think you can test this with your psql test, after you reformat
    that volume for real and start fresh.  You should be able to test this
    by running a continuous hammer reblock operation on the data while you
    are running the database test and see if corruption ultimately occurs.

    I have included my proposed patch/fix below but please do not apply
    it yet.  I want to try to reproduce the corruption here to actually
    test whether this fixes the issue or not.

    Once we fix the issue I'll have to work up a procedure to fix any
    broken filesystems.  Locating breakage is really easy, the hammer
    show and hammer checkmap commands can be used.  Fixing it, short
    of copying off the filesystem, may be more difficult.

    Jan, I am convinced that it is NOT a problem with the age of the
    hard drive or IDE interface.


diff --git a/sys/vfs/hammer/hammer_reblock.c b/sys/vfs/hammer/hammer_reblock.c
index 76ea6a8..c6cb937 100644
--- a/sys/vfs/hammer/hammer_reblock.c
+++ b/sys/vfs/hammer/hammer_reblock.c
@@ -130,6 +130,7 @@ retry:
 		 * Internal or Leaf node
+		KKASSERT(cursor.index < cursor.node->ondisk->count);
 		elm = &cursor.node->ondisk->elms[cursor.index];
 		reblock->key_cur.obj_id = elm->base.obj_id;
 		reblock->key_cur.localization = elm->base.localization;
@@ -144,6 +145,10 @@ retry:
 		 * If there is insufficient free space it may be due to
 		 * reserved bigblocks, which flushing might fix.
+		 * We must force a retest in case the unlocked cursor is
+		 * moved to the end of the leaf, or moved to an internal
+		 * node.
+		 *
 		 * WARNING: See warnings in hammer_unlock_cursor() function.
 		if (hammer_checkspace(trans->hmp, slop)) {
@@ -152,10 +157,11 @@ retry:
+			cursor.flags |= HAMMER_CURSOR_RETEST;
 			hammer_flusher_wait(trans->hmp, seq);
 			seq = hammer_flusher_async(trans->hmp, NULL);
-			continue;
+			goto skip;
@@ -198,11 +204,10 @@ retry:
 		if (error == 0) {
 			error = hammer_btree_iterate(&cursor);
 	if (error == ENOENT)
 		error = 0;

More information about the Kernel mailing list