Improving I/O efficiency and resilience

Matthew Dillon dillon at backplane.com
Wed Jun 16 10:13:06 PDT 2021


There are a lot of potential failure points, its a long chain of software
and hardware from the application's behavior all the way down to the
storage device's behavior in a failure.  Failure paths tend to not be
well-tested.  Reliability guarantees are... kinda just a pile of nonsense
really, there are just too many moving parts so all systems in the world
rely on stability first (i.e. not crashing, not failing in the first
place).  Redundancy mechanisms improve matters up to a point but they also
introduce further complexities.

This should be readily apparent to everyone since nearly every service in
existence sees regular glitches.  Be it Google (GMail and Google Docs, for
example, glitch-out all the time), brokerage, bank, ATMs, whatever.
Fail-over subsystems can twist themselves into knots when just the wrong
sequence of events occurrs.  There is a limit to just how reliable one can
make something.

For an application, ultimately the best guarantee is to have an
application-specific remote log that can be replayed to restore corrupted
state.  That is, to not entirely rely on localized fail-over, storage, or
other redundancy mechanisms.  One then relies on the near impossibility of
the dedicated remote log machine crashing and burning at exactly the same
time the primary servers crash and burn.

For HAMMER2, well... our failure paths are not well tested.  Like with most
other filesystems.  Usually I/O failures are simulated for testing but
actual storage system failures can have different false-flag behaviors.
What HAMMER2 does is flush in two stages.  In the first stage it
asynchronously writes all dirty blocks except the volume header (block copy
on write filesystem so writing dirty blocks does not modify the
originals).  Then it waits for those asynchronous writes to complete.  Then
it issues a device flush.  And finally it writes out an updated volume
header.  Any system crash occurring prior to the writing out of the updated
volume header simply restores the filesystem to its pre-flush state upon
reboot because the old volume header is not directly or indirectly pointing
to any of the new blocks.

And for DFly, an async block write failure leaves the buffer marked dirty
so the filesystem data and meta-data state remains consistent on the live
system (even if it cannot be flushed).  This is a choice taken from a list
of bad choices, because leaving a block dirty means that dirty blocks can
build-up in ram until you run out of ram.  But it is better than the
alternative (presenting stale data to a filesystem and/or to an application
which then causes a chain-reaction of corruption on a running system).

But realistically, even the most sophisticated fault-tolerant systems hit
situations which require manual intervention.  There are just too many
moving parts in a modern system that depend on a multitude of behaviors
that are specified by standards but not necessarily followed at every
stage.  So, ultimately, the best protection remains having
application-level redundancy via a replayable remote log (verses kernel,
filesystem, or block-level redundancy).  Other forms of redundancy can
reduce error rates but cannot eliminate them, and ultimately reach a point
where new potential failure conditions introduced by the added
sophistication exceeds the failure conditions that are being protected
against.

Also, redundancies can introduce points of attack.  If you want to crater
the performance of a competitor through hacking, the redundancy subsystems
offer a tempting target.

Almost universally, even commercial systems rely on stability and the added
redundancies are only able to deal with a subset of 'common' problems on a
live system.  And then they fall-back to replaying logs to restore
otherwise unrecoverably corrupted state.

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20210616/e763ad1d/attachment.htm>


More information about the Users mailing list