Notes on disk flush commands and why they are important.

Sat Jan 9 14:46:20 PST 2010

    I am going to use this opportunity to comment a bit on why issuing the
    disk flush command is so important for meta-data updates on filesystems
    which implement instant recovery after a crash.  FreeBSD does not
    issue the command virtually anywhere.  Not on UFS fsync(), not with
    softupdates(), and apparently not with the new journaling code which
    I am happy to see is making progress.

    Rather than use the thread on the FreeBSD lists to comment, which
    would surely lead to a flame war and devalue the excellent work being
    done in FreeBSD to add journaling, I am going to describe the issue
    here.

    Writes to disk can be broken down into two categories:

    (1) Meta-data and other structural media writes.

    (2) Log writes in general.  Forward log logical operations,
	such as write()+fsync(), which do not (necessarily) have to
	involve meta-data updates on-media in order to be flushable,
	also known as an intent log.

	And also UNDO (rollback) log operations and journaling (forward
	log operations for meta-data, similar to an intent log but
	not operating as a logical layer).

	And other methods.

    Meta-data on the physical media presents a recovery burden after a
    crash.  This is why UFS has fsck, this is why softupdates was written,
    why HAMMER has an UNDO FIFO, and why the new UFS journaling code is
    being worked on in FreeBSDland.

    Different structural forms for meta-data have varying degrees of
    fragility.  For example the single-B-Tree mechanic that HAMMER uses
    is more fragile than the cylinder group / blockmap mechanic that UFS
    uses.

    When flushing data to a hard drive it is vastly important to minimize
    the meta-data corruption which might occur due to a crash.  This is
    what softupdates tries to do, for example.  HAMMER's somewhat more
    fragile meta-data structures on-media require us to 100% eliminate
    any possibility of meta-data corruption, and we use disk flush
    commands to actually flush the disk cache onto the media to separate
    the flush stages.  It could be argued that UFS's somewhat less fragile
    format can make due without issuing such flushes but even so people
    have lost filesystems to softupdates and in the general case as
    storage gets larger and larger the margin for error gets smaller and
    smaller.  You just CANNOT AFFORD for a filesystem to go bad due to an
    event (crash, powerdown, etc) on a multi-terrabyte filesystem.

    It is considerably LESS important to avoid data loss when the only
    data loss possible is the last few write()'s to a file, as long as
    the entire previous state of the contents of the file can be recovered
    (as if those write()'s did not occur), with no misordered or partial
    writes, and no meta-data is lost.

    This less important data loss case is the one which most BSD's, including
    FreeBSD and DragonFly, use for UFS write()+fsync().  Under UFS a
    fsync() does not issue a media flush, it simply issues the I/O
    and leaves the data sitting the drive cache.  HAMMER defaults to full
    synchronization semantics (and as I said I will be adding a sysctl
    to allow the particular write()+fsync() case to devolve to just a
    log-write without a full disk sync command).

					--

    Ok, now my comment on UFS, softupdates, and the new journaling work
    being done in FreeBSD.  Here's my comment:

	"Kudos on the work!  But for gods sakes implement proper disk
	synchronization mechanics!".

    Here's why:

    * You can implement the most important mechanics, those for
      database-style write()/fsync() operations, using only your
      journal with relaxed media flush requirements without endangering
      any meta-data.  i.e. anything related to meta-data would always
      use full meta flush mechanics.  In otherwords, you can bake your
      cake and eat it too!  So don't stop with it 90% done.

    * All my comments above on the fragility of meta-data updates in the
      face of out of order commits to disk, which is what you will get,
      apply.  A simple write()/fsync() operation with a forward log
      could use relaxed semantics, but you are playing with fire if you
      try to do that with meta-data updates.

      Softupdates ALREADY assumes ordering between flush groups, and
      this has frankly bitten me on numerous occassions in past years.
      That is, it waits for X parallel I/O's to complete before initiating
      the next block of Y parallel I/O's.  This is ALREADY broken to some
      degree.  This is ALREADY too fragile.  Don't make it *MORE* fragile
      by assuming ordering between journal updates and meta-data updates
      queued by softupdates.  JOURNAL1 -> SOFT1 -> JOURNAL2 -> SOFT2
      could end up being ordered on the disk:  SOFT2 -> SOFT1 -> JOURNAL2 ->
      (partial) JOURNAL1.  In that example, since the journal is strictly
      ordered from a recovery standpoint, the journal will be empty.

      In another example, say the actual order of the writes to the
      media is SOFT2 -> SOFT1 -> JOURNAL1 -> JOURNAL2, now your journal
      is trying to undo operations related to SOFT1 that may have already
      been overwritten by SOFT2 for which no journal exists.

      Too much fire.

    * A large number of your installations will be running systems without
      a UPS or without shutdown signaling mechanics.  The enterprise systems
      will not, but these operating systems are not designed JUST for
      enterprise use.  How about the home client or server?  What about
      turnkey systems trying to minimize costs?

    * As drives age and start to use more renamed sectors, write flushes
      take longer.  The longer write flushes take the higher the probability
      that you will lose data sitting the drive's write cache.

    * Intermediate caching (iSCSI devices running on UNIX, for example).
      It is impossible optimize those operations if the targets cannot
      make any assumptions with regards to synchronization mechanics,
      requiring fully synchronized writes for each I/O individually.

    * Port-powered devices.  I'd mention USB but USB doesn't handle the
      disk sync command very well anyway, but there are numerous
      plug'n'play E-SATA devices which while separately powered
      provide a means of quickly disconnecting the device.  Hmm,
      I think E-SATA disk keys exist now too, in fact.

      The easier it is to disconnect the device, the higher the chance of
      the device getting disconnected at a bad time, including power
      (hot-swap), UPS or not.

      Particularly for port-multipliers, and also for SSDs or any
      externalizable device, it is far easier than you might imagine to
      depower a device accidently.  Human error.

    * Battery-backed RAID systems are nice, and expensive, but that's no
      reason to throw away the more typical installation where the drive
      cache is used directly.

      This is particularly true for people using SSDs.  Sure, a few years
      from now I expect most SSDs will be able to flush unwritten dirty data
      to local flash.  It hasn't happened yet and it doesn't help with
      layered caches in the storage path anyway.

      I will reiterate that when one is playing with multi-terrabyte
      filesystems, the margin for error is significantly reduced.
      Power loss events WILL OCCUR.  Firmware crashes WILL OCCUR.
      Power supplies still blow up.  It makes no sense to ignore these
      sources of error.

    * UPSs are great, I have one... but properly powering down systems
      attached to a UPS is actually not trivial.  In all my years using
      UPSes through power failures systems have only powered down properly
      75% of the time.  The other 25%... those wound up being hard
      power-downs.  Bye bye disk cache (and often bye-bye drive, but
      that's another matter).

    * NFS and other multi-layered filesystems depend on proper
      synchronization mechanics for reboot recovery to work
      properly.  The more layers you have, the more likely something
      will break and all your assumptions will go flying out the
      window.

    * VM's can't cache or optimize I/O's if you are forced to use
      sync-to-media for every I/O because you can't depend on disk
      flushing.  For example, a a FreeBSD client running on a linux
      host.  Goodbye intermediate caching layer if the linux host
      dies.

      I've run VMs on windows boxes where the windows box hardlocks
      and completely destroys the 'drive cache'.  I found out the
      hard way that some VMs ignore the disk synchronization command,
      even!  But I don't expect that to last long as VMs become more
      important.

    Basically it comes down to (1) Retaining the ability for devices and
    intermediate platforms to properly cache and optimize write I/O,
    (2) The fallacy of the assumption that nothing matters unless caches
    are battery-backed, and (3) In large scale systems the assumptions
    for data integrity have extremely serious consequences if that
    promise of data integrity turns out to be not quite true in all
    circumstances.  Human error guarantees that.

    So I would not-so-humbly suggest that proper media flush semantics
    be implemented for any UFS journaling implementation, particularly
    one done on top of softupdates.  PARTICULARLY if you want to get rid
    of fsck for real.  For meta-data, of course.  write()+fsync()
    operations which can be flushed with a single log entry and no meta-data
    writes could use relaxed semantics (though IMHO not as a default).

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>