GSOC: Device mapper mirror target

Thu Apr 7 09:48:14 PDT 2011

On Thu, 7 Apr 2011 11:57:37 -0400
Venkatesh Srinivas <me at endeavour.zapto.org> wrote:

> On Thu, Apr 7, 2011 at 10:27 AM, Adam Hoka <adam.hoka at gmail.com> wrote:
> > Please see my proposal:
> >
> > http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/ahoka/1#
> 
> Hi!
> 
> I'll take a look at your proposal in just a bit. Here are some things
> you might want to think about when looking at RAID1 though...
> 
> Here are some details about how I planned to do dmirror and why I
> think RAID1 is a much more difficult problem than it seems at first
> glance.
> 
> Imagine a RAID1 of two disks, A and B; you have an outstanding set of
> I/O operations, buf1, buf2, buf3, buf4*, buf5, buf6, buf7, buf8*. The
> BUFs are a mix of READ and WRITEs. At some point, your friendly
> neighborhood DragonFly developer walks over and pulls the plug on your
> system (you said you were running NetBSD! Its a totally valid excuse!
> :))

Yes, this is indeed an important design aspect.

> Each of the write bufs could be totally written, partially written, or
> not written at all to each of the disks. More importantly, each disk
> could have seen and completed (or not completed) the requests in a
> different order. And this reorder can happen after the buf has been
> declared done and biodone() has been called (and we've reported
> success to userland). This could be because of reordering or
> coalescing and the drive controller or the drive, for example.

On thing I learned from writing the flash block driver: never biodone
i/o that is not really finished. You can do it, but its a PITA to handle.
Also, its the simplest to do write syncronously, we can report success
when the fastest write is done or wait for the slower.

> So in this case, lets say disk A had seen and totally written buf2 and
> partially written buf1. Disk B had seen and totally written buf1 and
> not seen buf2. And we'd reported success to the filesystem above
> already.
> 
> So when we've chastised the neighborhood DragonFly developer and
> powered on the system, we have a problem. We have two halves of a RAID
> mirror that are not in sync. The simplest way to sync them would be to
> declare one of the two disks correct and copy one over the other
> (possibly optimizing the copy with a block bitmap, as you suggested
> and as Linux's MD raid1 (among many others) implement; block bitmaps
> are more difficult than they seem at first [1]).
> 
> So lets declare disk A as correct and copy it over disk B. Now, disk
> B's old copy of buf2->block is overwritten with the correct copy from
> disk A and disk B's correct, up-to-date copy of buf1->block is
> overwritten with an scrambled version of buf1->. This is not okay,
> because we'd already reported success at writing both buf1 and buf2 to
> the filesystem above.
> Oops.
> 
> This failure mode has always been possible in single-disk
> configurations where write reordering is possible; file systems have
> long had a solitary tool to fight the chaos,  BUF_CMD_FLUSH. A FLUSH
> BUF acts as a barrier, it does not return until all prior requests
> have completed and hit media and does not allow requests from beyond
> the FLUSH point to proceed until all requests prior to the barrier are
> complete [2]. However the problem multi-disk arrays face is that disks
> FLUSH independently. [3: important sidebar if you run UFS]. A FLUSH on
> disk X says nothing about the state of disk Y and says nothing about
> selecting disk Y after power cycling.

NAND bad block tables are versioned kinda like you describe in the
following: each copy has a number,
which is increased on update. This increase only happens after a successful
upgrade. On startup, you look for the highest version. Also CRC could
be useful too, if the CRC doesnt match, its a dirty block.

> ---
> 
> The dmirror design I was working on solved the problem through
> overwhelming force -- adding a physical journal and a header sector to
> each device. Each device would log all of the blocks it was going to
> write to the journal. It would then complete a FLUSH request to ensure
> the blocks had hit disk. Only then would we update the blocks we'd
> meant to. After we updated the target blocks, we would issue another
> FLUSH command. Then we'd update a counter in a special header sector.
> [assumption: writes to single sectors on disk are atomic and survive
> DragonFly developers removing power]. Each journal entry would contain
> (the value of the counter)+1 before the operations were complete. To
> know if a journal entry was correctly written, each entry would also
> include a checksum of the update it was going to carry out.
> 
> The recovery path would use the header's counter field to determine
> which disk was most current. It would then replay the necessary
> journal entries (entries with a counter > the header->counter) to
> bring that device into sync (perhaps it would only replay these into
> memory into overlay blocks, I'd not decided) and then sync that disk
> onto all of the others.
> 
> Concretely, from dmirror_strategy:
> /*
>  * dmirror_strategy()
>  *
>  *	Initiate I/O on a dmirror VNODE.
>  *
>  *	READ:  disk_issue_read -> disk_read_bio_done -> (disk_issue_read)
>  *
>  *	The read-path uses push_bio to get a new BIO structure linked to
>  *	the BUF and ties the new BIO to the disk and mirror it is issued
>  *	on behalf of. The callback is set to disk_read_bio_done.
>  *	In disk_read_bio_done, if the request succeeded, biodone() is called;
>  *	if the request failed, the BIO is reinitialized with a new disk
>  *	in the mirror and reissued till we get a success or run out of disks.
>  *
>  *	WRITE: disk_issue_write -> disk_write_bio_done(..) -> disk_write_tx_done
>  *	
>  *	The write path allocates a write group and transaction structures for
>  *	each backing disc. It then sets up each transaction and issues them
>  *	to the backing devices. When all of the devices have reported in,
>  *	disk_write_tx_done finalizes the original BIO and deallocates the
>  *	write group.
>  */
> 
> A write group was the term for all of the state associated with a
> single write to all of the devices. A write transaction was the term
> for all of the state associated with a single write cycle to one disk.
> 
> Concretely for write groups and write transactions:
> 
> enum dmirror_write_tx_state {
> 	DMIRROR_START,
> 	DMIRROR_JOURNAL_WRITE,
> 	DMIRROR_JOURNAL_FLUSH,
> 	DMIRROR_DATA_WRITE,
> 	DMIRROR_DATA_FLUSH,
> 	DMIRROR_SUPER_WRITE,
> 	DMIRROR_SUPER_FLUSH,
> };
> 
> A write transaction was guided through a series of states by issuing
> I/O via vn_strategy() and transitioning on biodone() calls. At the
> DMIRROR_START state, it was not yet issued to the disk, just freshly
> allocated. Journal writes were issued and the tx entered the
> DMIRROR_JOURNAL_WRITE state. When the journal writes completed, we
> entered the JOURNAL_FLUSH state and issued a FLUSH bio. When the flush
> completed, we entered the DATA_WRITE state; next the DATA_FLUSH state,
> then the SUPER_WRITE and then the SUPER_FLUSH state. When the
> superblock flushed, we walked to our parent write group and marked
> this disk as successfully completing all of the necessary steps. When
> all of the disks had reported, we finished the write group and finally
> called biodone() on the original bio.
> 
> struct dmirror_write_tx {
> 	struct dmirror_write_group *write_group;
> 	struct bio 			bio;
> 	enum dmirror_write_tx_state	state;
> };
> 
> The write_tx_done path was the biodone call for a single write
> request. The embedded bio was initialized via initbiobuf().
> 
> enum dmirror_wg_state {
> 	DMIRROR_WRITE_OK,
> 	DMIRROR_WRITE_FAIL
> };
> 
> struct dmirror_write_group {
> 	struct lock			lk;
> 	struct bio			*obio;
> 	struct dmirror_dev		*dmcfg; /* Parent dmirror */
> 	struct kref			ref;
>         /* some kind of per-mirror linkages */
>        /* some kind of per-disk linkages */
> };
> 
> The write group tracked the state of a write to all of the devices;
> the embedded lockmgr lock prevented concurrent write_tx_done()s from
> operating. The bio ptr was to the original write request. The ref
> (kref no longer exists, so this would be a counter now) was the number
> of outstanding devices. The per-mirror and per-disk linkages allowed a
> fault on any I/O operation to a disk in the mirror to prevent any
> future I/O from being issued to that disk; the code on fault would
> walk all of the requests and act as though that particular write TX
> finished with a B_ERROR buffer.
> 
> The disk read path was simpler -- a single disk in the mirror was
> selected and vn_strategy() called. The biodone callback checked if
> there was a read error; if so, we faulted the disk and continued
> selecting mirrors to issue to until we found one that worked. Each
> faulted disk had outstanding I/Os killed.
> 
> I had not given thought as to what to do when a mirror was running in
> a degraded configuration or with an unsynced disk trying to catch up;
> the latter requires care in that the unsynced disk can serve reads by
> not writes. Also about what to do to live remove a disk. Or how to
> track all of the disks in a mirror. (It'd be nice to have each disk
> know all the other mirror components via UUID or something and to
> record the last counter val it knew about for the other disk. This
> will prevent disasters where each disk in a mirror is run
> independently in a degraded setup and then brought back together.)
> 
> AFAIK, no RAID1 is this paranoid (sample set: Linux MD, Gmirror, ccd).
> And it is a terrible design from a performance perspective -- 3 FLUSH
> BIOs for every set of block writes. But it does give you a hope of
> correctly recovering your RAID1 in the event of a powercycle, crash,
> or disk failure...
> 
> Please tell me if this sounds crazy, overkill, or is just wrong! Or if
> you want to work on this or would like to work on a classic bitmap +
> straight mirror RAID1.

Imho this is an overkill, as you want to gurantee things that is not
the function of this layer of abstraction. :-)

> -- vs
> 
> [1]: A block bitmap of touched blocks requires care because you must
> be sure that before any block is touched, the bitmap has that block
> marked. Sure in the sense that the bitmap block update has hit media.
> 
> [2]: I've never seen seen exactly what you can assume about
> BUF_CMD_FLUSH (or BIO_FLUSH as it might be known as in other BSDs)...
> this is a strong set of assumptions, I'd love to hear if I'm wrong.
> 
> [3]: UFS in DragonFly and in FreeBSD does not issue any FLUSH
> requests. I have no idea how this can be correct... I'm pretty sure it
> is not.

-- 
NetBSD - Simplicity is prerequisite for reliability