No subject

Fri Apr 16 15:05:24 PDT 2010

om>
From: Matthew Dillon <dillon at apollo.backplane.com>
Subject: Re: I/O errors on Hammer volume
Date: Fri, 16 Apr 2010 15:02:54 -0700 (PDT)
BestServHost: crater.dragonflybsd.org
List-Post: <mailto:kernel at crater.dragonflybsd.org>
List-Subscribe: <mailto:kernel-request at crater.dragonflybsd.org?body=subscribe>
List-Unsubscribe: <mailto:kernel-request at crater.dragonflybsd.org?body=unsubscribe>
List-Help: <mailto:kernel-request at crater.dragonflybsd.org?body=help>
List-Owner: <mailto:owner-kernel at crater.dragonflybsd.org>
References: <20100413063525.GA3297 at sekishi.zefyris.com> <x2u95250f81004130236m4e57fc54x4718da8affed3282 at mail.gmail.com> <20100413105442.GA3683 at sekishi.zefyris.com> <201004131849.o3DIn3X2060284 at apollo.backplane.com> <20100414053738.GB10928 at sekishi.zefyris.com> <201004140602.o3E625ZE066170 at apollo.backplane.com> <20100414062504.GA11007 at sekishi.zefyris.com> <201004151848.o3FImQA3095944 at apollo.backplane.com> <20100415190538.GA13123 at sekishi.zefyris.com> <201004162119.o3GLJuRH022916 at apollo.backplane.c
om>
Sender: kernel-errors at crater.dragonflybsd.org
Errors-To: kernel-errors at crater.dragonflybsd.org
Lines: 95
NNTP-Posting-Host: 216.240.41.25
X-Trace: 1271455662 crater_reader.dragonflybsd.org 9058 216.240.41.25
Xref: crater_reader.dragonflybsd.org dragonfly.kernel:14283

    I think I found the smoking gun but I can't be sure until I see
    the show output from Francois.

    I noticed that all the bad CRC'd records were typically around element
    30-32 in the B-Tree node (out of ~64 elements).  That is, the middle
    of the node.

    This implies a race between the reblocker/rebalancer and a node split
    during an insertion, or a race between the reblocker and the rebalancer.

    I am testing a fix now and I am not 100% sure that this was the issue,
    but there are a lot of things pointing to it:

    * In both Jan's and Francois's cases the inodes that got corrupted
      were in areas of the filesystem under a heavy write/create/delete
      load.

    * The corrupted records appear to nearly always be in the middle of
      the B-Tree node, which implies a race against an insertion or a
      rebalancing operation while the reblocker is running.

    * And I found a bug in the reblocker that was exposed by recent work
      (the work itself was not buggy, it just exposed the bug that already
      existed) whereby the reblocker may reblock an element after relocking
      the node but without properly checking that the element is still valid.

    Jan, I think you can test this with your psql test, after you reformat
    that volume for real and start fresh.  You should be able to test this
    by running a continuous hammer reblock operation on the data while you
    are running the database test and see if corruption ultimately occurs.

    I have included my proposed patch/fix below but please do not apply
    it yet.  I want to try to reproduce the corruption here to actually
    test whether this fixes the issue or not.

    Once we fix the issue I'll have to work up a procedure to fix any
    broken filesystems.  Locating breakage is really easy, the hammer
    show and hammer checkmap commands can be used.  Fixing it, short
    of copying off the filesystem, may be more difficult.

    Jan, I am convinced that it is NOT a problem with the age of the
    hard drive or IDE interface.

						-Matt

diff --git a/sys/vfs/hammer/hammer_reblock.c b/sys/vfs/hammer/hammer_reblock.c
index 76ea6a8..c6cb937 100644
--- a/sys/vfs/hammer/hammer_reblock.c
+++ b/sys/vfs/hammer/hammer_reblock.c
@@ -130,6 +130,7 @@ retry:
 		/*
 		 * Internal or Leaf node
 		 */
+		KKASSERT(cursor.index < cursor.node->ondisk->count);
 		elm = &cursor.node->ondisk->elms[cursor.index];
 		reblock->key_cur.obj_id = elm->base.obj_id;
 		reblock->key_cur.localization = elm->base.localization;
@@ -144,6 +145,10 @@ retry:
 		 * If there is insufficient free space it may be due to
 		 * reserved bigblocks, which flushing might fix.
 		 *
+		 * We must force a retest in case the unlocked cursor is
+		 * moved to the end of the leaf, or moved to an internal
+		 * node.
+		 *
 		 * WARNING: See warnings in hammer_unlock_cursor() function.
 		 */
 		if (hammer_checkspace(trans->hmp, slop)) {
@@ -152,10 +157,11 @@ retry:
 				break;
 			}
 			hammer_unlock_cursor(&cursor);
+			cursor.flags |= HAMMER_CURSOR_RETEST;
 			hammer_flusher_wait(trans->hmp, seq);
 			hammer_lock_cursor(&cursor);
 			seq = hammer_flusher_async(trans->hmp, NULL);
-			continue;
+			goto skip;
 		}
 
 		/*
@@ -198,11 +204,10 @@ retry:
 			bwillwrite(HAMMER_XBUFSIZE);
 			hammer_lock_cursor(&cursor);
 		}
-
+skip:
 		if (error == 0) {
 			error = hammer_btree_iterate(&cursor);
 		}
-
 	}
 	if (error == ENOENT)
 		error = 0;