GSOC: Device mapper mirror target

Thu Apr 7 09:01:28 PDT 2011

On Thu, Apr 7, 2011 at 10:27 AM, Adam Hoka <adam.hoka at gmail.com> wrote:
> Please see my proposal:
>
> http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/ahoka/1#

Hi!

I'll take a look at your proposal in just a bit. Here are some things
you might want to think about when looking at RAID1 though...

Here are some details about how I planned to do dmirror and why I
think RAID1 is a much more difficult problem than it seems at first
glance.

Imagine a RAID1 of two disks, A and B; you have an outstanding set of
I/O operations, buf1, buf2, buf3, buf4*, buf5, buf6, buf7, buf8*. The
BUFs are a mix of READ and WRITEs. At some point, your friendly
neighborhood DragonFly developer walks over and pulls the plug on your
system (you said you were running NetBSD! Its a totally valid excuse!
:))

Each of the write bufs could be totally written, partially written, or
not written at all to each of the disks. More importantly, each disk
could have seen and completed (or not completed) the requests in a
different order. And this reorder can happen after the buf has been
declared done and biodone() has been called (and we've reported
success to userland). This could be because of reordering or
coalescing and the drive controller or the drive, for example.

So in this case, lets say disk A had seen and totally written buf2 and
partially written buf1. Disk B had seen and totally written buf1 and
not seen buf2. And we'd reported success to the filesystem above
already.

So when we've chastised the neighborhood DragonFly developer and
powered on the system, we have a problem. We have two halves of a RAID
mirror that are not in sync. The simplest way to sync them would be to
declare one of the two disks correct and copy one over the other
(possibly optimizing the copy with a block bitmap, as you suggested
and as Linux's MD raid1 (among many others) implement; block bitmaps
are more difficult than they seem at first [1]).

So lets declare disk A as correct and copy it over disk B. Now, disk
B's old copy of buf2->block is overwritten with the correct copy from
disk A and disk B's correct, up-to-date copy of buf1->block is
overwritten with an scrambled version of buf1->. This is not okay,
because we'd already reported success at writing both buf1 and buf2 to
the filesystem above.
Oops.

This failure mode has always been possible in single-disk
configurations where write reordering is possible; file systems have
long had a solitary tool to fight the chaos,  BUF_CMD_FLUSH. A FLUSH
BUF acts as a barrier, it does not return until all prior requests
have completed and hit media and does not allow requests from beyond
the FLUSH point to proceed until all requests prior to the barrier are
complete [2]. However the problem multi-disk arrays face is that disks
FLUSH independently. [3: important sidebar if you run UFS]. A FLUSH on
disk X says nothing about the state of disk Y and says nothing about
selecting disk Y after power cycling.

---

The dmirror design I was working on solved the problem through
overwhelming force -- adding a physical journal and a header sector to
each device. Each device would log all of the blocks it was going to
write to the journal. It would then complete a FLUSH request to ensure
the blocks had hit disk. Only then would we update the blocks we'd
meant to. After we updated the target blocks, we would issue another
FLUSH command. Then we'd update a counter in a special header sector.
[assumption: writes to single sectors on disk are atomic and survive
DragonFly developers removing power]. Each journal entry would contain
(the value of the counter)+1 before the operations were complete. To
know if a journal entry was correctly written, each entry would also
include a checksum of the update it was going to carry out.

The recovery path would use the header's counter field to determine
which disk was most current. It would then replay the necessary
journal entries (entries with a counter > the header->counter) to
bring that device into sync (perhaps it would only replay these into
memory into overlay blocks, I'd not decided) and then sync that disk
onto all of the others.

Concretely, from dmirror_strategy:
/*
 * dmirror_strategy()
 *
 *	Initiate I/O on a dmirror VNODE.
 *
 *	READ:  disk_issue_read -> disk_read_bio_done -> (disk_issue_read)
 *
 *	The read-path uses push_bio to get a new BIO structure linked to
 *	the BUF and ties the new BIO to the disk and mirror it is issued
 *	on behalf of. The callback is set to disk_read_bio_done.
 *	In disk_read_bio_done, if the request succeeded, biodone() is called;
 *	if the request failed, the BIO is reinitialized with a new disk
 *	in the mirror and reissued till we get a success or run out of disks.
 *
 *	WRITE: disk_issue_write -> disk_write_bio_done(..) -> disk_write_tx_done
 *	
 *	The write path allocates a write group and transaction structures for
 *	each backing disc. It then sets up each transaction and issues them
 *	to the backing devices. When all of the devices have reported in,
 *	disk_write_tx_done finalizes the original BIO and deallocates the
 *	write group.
 */

A write group was the term for all of the state associated with a
single write to all of the devices. A write transaction was the term
for all of the state associated with a single write cycle to one disk.

Concretely for write groups and write transactions:

enum dmirror_write_tx_state {
	DMIRROR_START,
	DMIRROR_JOURNAL_WRITE,
	DMIRROR_JOURNAL_FLUSH,
	DMIRROR_DATA_WRITE,
	DMIRROR_DATA_FLUSH,
	DMIRROR_SUPER_WRITE,
	DMIRROR_SUPER_FLUSH,
};

A write transaction was guided through a series of states by issuing
I/O via vn_strategy() and transitioning on biodone() calls. At the
DMIRROR_START state, it was not yet issued to the disk, just freshly
allocated. Journal writes were issued and the tx entered the
DMIRROR_JOURNAL_WRITE state. When the journal writes completed, we
entered the JOURNAL_FLUSH state and issued a FLUSH bio. When the flush
completed, we entered the DATA_WRITE state; next the DATA_FLUSH state,
then the SUPER_WRITE and then the SUPER_FLUSH state. When the
superblock flushed, we walked to our parent write group and marked
this disk as successfully completing all of the necessary steps. When
all of the disks had reported, we finished the write group and finally
called biodone() on the original bio.

struct dmirror_write_tx {
	struct dmirror_write_group *write_group;
	struct bio 			bio;
	enum dmirror_write_tx_state	state;
};

The write_tx_done path was the biodone call for a single write
request. The embedded bio was initialized via initbiobuf().

enum dmirror_wg_state {
	DMIRROR_WRITE_OK,
	DMIRROR_WRITE_FAIL
};

struct dmirror_write_group {
	struct lock			lk;
	struct bio			*obio;
	struct dmirror_dev		*dmcfg; /* Parent dmirror */
	struct kref			ref;
        /* some kind of per-mirror linkages */
       /* some kind of per-disk linkages */
};

The write group tracked the state of a write to all of the devices;
the embedded lockmgr lock prevented concurrent write_tx_done()s from
operating. The bio ptr was to the original write request. The ref
(kref no longer exists, so this would be a counter now) was the number
of outstanding devices. The per-mirror and per-disk linkages allowed a
fault on any I/O operation to a disk in the mirror to prevent any
future I/O from being issued to that disk; the code on fault would
walk all of the requests and act as though that particular write TX
finished with a B_ERROR buffer.

The disk read path was simpler -- a single disk in the mirror was
selected and vn_strategy() called. The biodone callback checked if
there was a read error; if so, we faulted the disk and continued
selecting mirrors to issue to until we found one that worked. Each
faulted disk had outstanding I/Os killed.

I had not given thought as to what to do when a mirror was running in
a degraded configuration or with an unsynced disk trying to catch up;
the latter requires care in that the unsynced disk can serve reads by
not writes. Also about what to do to live remove a disk. Or how to
track all of the disks in a mirror. (It'd be nice to have each disk
know all the other mirror components via UUID or something and to
record the last counter val it knew about for the other disk. This
will prevent disasters where each disk in a mirror is run
independently in a degraded setup and then brought back together.)

AFAIK, no RAID1 is this paranoid (sample set: Linux MD, Gmirror, ccd).
And it is a terrible design from a performance perspective -- 3 FLUSH
BIOs for every set of block writes. But it does give you a hope of
correctly recovering your RAID1 in the event of a powercycle, crash,
or disk failure...

Please tell me if this sounds crazy, overkill, or is just wrong! Or if
you want to work on this or would like to work on a classic bitmap +
straight mirror RAID1.

-- vs

[1]: A block bitmap of touched blocks requires care because you must
be sure that before any block is touched, the bitmap has that block
marked. Sure in the sense that the bitmap block update has hit media.

[2]: I've never seen seen exactly what you can assume about
BUF_CMD_FLUSH (or BIO_FLUSH as it might be known as in other BSDs)...
this is a strong set of assumptions, I'd love to hear if I'm wrong.

[3]: UFS in DragonFly and in FreeBSD does not issue any FLUSH
requests. I have no idea how this can be correct... I'm pretty sure it
is not.