Notes on disk flush commands and why they are important.
dillon at apollo.backplane.com
Sat Jan 9 14:46:20 PST 2010
I am going to use this opportunity to comment a bit on why issuing the
disk flush command is so important for meta-data updates on filesystems
which implement instant recovery after a crash. FreeBSD does not
issue the command virtually anywhere. Not on UFS fsync(), not with
softupdates(), and apparently not with the new journaling code which
I am happy to see is making progress.
Rather than use the thread on the FreeBSD lists to comment, which
would surely lead to a flame war and devalue the excellent work being
done in FreeBSD to add journaling, I am going to describe the issue
Writes to disk can be broken down into two categories:
(1) Meta-data and other structural media writes.
(2) Log writes in general. Forward log logical operations,
such as write()+fsync(), which do not (necessarily) have to
involve meta-data updates on-media in order to be flushable,
also known as an intent log.
And also UNDO (rollback) log operations and journaling (forward
log operations for meta-data, similar to an intent log but
not operating as a logical layer).
And other methods.
Meta-data on the physical media presents a recovery burden after a
crash. This is why UFS has fsck, this is why softupdates was written,
why HAMMER has an UNDO FIFO, and why the new UFS journaling code is
being worked on in FreeBSDland.
Different structural forms for meta-data have varying degrees of
fragility. For example the single-B-Tree mechanic that HAMMER uses
is more fragile than the cylinder group / blockmap mechanic that UFS
When flushing data to a hard drive it is vastly important to minimize
the meta-data corruption which might occur due to a crash. This is
what softupdates tries to do, for example. HAMMER's somewhat more
fragile meta-data structures on-media require us to 100% eliminate
any possibility of meta-data corruption, and we use disk flush
commands to actually flush the disk cache onto the media to separate
the flush stages. It could be argued that UFS's somewhat less fragile
format can make due without issuing such flushes but even so people
have lost filesystems to softupdates and in the general case as
storage gets larger and larger the margin for error gets smaller and
smaller. You just CANNOT AFFORD for a filesystem to go bad due to an
event (crash, powerdown, etc) on a multi-terrabyte filesystem.
It is considerably LESS important to avoid data loss when the only
data loss possible is the last few write()'s to a file, as long as
the entire previous state of the contents of the file can be recovered
(as if those write()'s did not occur), with no misordered or partial
writes, and no meta-data is lost.
This less important data loss case is the one which most BSD's, including
FreeBSD and DragonFly, use for UFS write()+fsync(). Under UFS a
fsync() does not issue a media flush, it simply issues the I/O
and leaves the data sitting the drive cache. HAMMER defaults to full
synchronization semantics (and as I said I will be adding a sysctl
to allow the particular write()+fsync() case to devolve to just a
log-write without a full disk sync command).
Ok, now my comment on UFS, softupdates, and the new journaling work
being done in FreeBSD. Here's my comment:
"Kudos on the work! But for gods sakes implement proper disk
* You can implement the most important mechanics, those for
database-style write()/fsync() operations, using only your
journal with relaxed media flush requirements without endangering
any meta-data. i.e. anything related to meta-data would always
use full meta flush mechanics. In otherwords, you can bake your
cake and eat it too! So don't stop with it 90% done.
* All my comments above on the fragility of meta-data updates in the
face of out of order commits to disk, which is what you will get,
apply. A simple write()/fsync() operation with a forward log
could use relaxed semantics, but you are playing with fire if you
try to do that with meta-data updates.
Softupdates ALREADY assumes ordering between flush groups, and
this has frankly bitten me on numerous occassions in past years.
That is, it waits for X parallel I/O's to complete before initiating
the next block of Y parallel I/O's. This is ALREADY broken to some
degree. This is ALREADY too fragile. Don't make it *MORE* fragile
by assuming ordering between journal updates and meta-data updates
queued by softupdates. JOURNAL1 -> SOFT1 -> JOURNAL2 -> SOFT2
could end up being ordered on the disk: SOFT2 -> SOFT1 -> JOURNAL2 ->
(partial) JOURNAL1. In that example, since the journal is strictly
ordered from a recovery standpoint, the journal will be empty.
In another example, say the actual order of the writes to the
media is SOFT2 -> SOFT1 -> JOURNAL1 -> JOURNAL2, now your journal
is trying to undo operations related to SOFT1 that may have already
been overwritten by SOFT2 for which no journal exists.
Too much fire.
* A large number of your installations will be running systems without
a UPS or without shutdown signaling mechanics. The enterprise systems
will not, but these operating systems are not designed JUST for
enterprise use. How about the home client or server? What about
turnkey systems trying to minimize costs?
* As drives age and start to use more renamed sectors, write flushes
take longer. The longer write flushes take the higher the probability
that you will lose data sitting the drive's write cache.
* Intermediate caching (iSCSI devices running on UNIX, for example).
It is impossible optimize those operations if the targets cannot
make any assumptions with regards to synchronization mechanics,
requiring fully synchronized writes for each I/O individually.
* Port-powered devices. I'd mention USB but USB doesn't handle the
disk sync command very well anyway, but there are numerous
plug'n'play E-SATA devices which while separately powered
provide a means of quickly disconnecting the device. Hmm,
I think E-SATA disk keys exist now too, in fact.
The easier it is to disconnect the device, the higher the chance of
the device getting disconnected at a bad time, including power
(hot-swap), UPS or not.
Particularly for port-multipliers, and also for SSDs or any
externalizable device, it is far easier than you might imagine to
depower a device accidently. Human error.
* Battery-backed RAID systems are nice, and expensive, but that's no
reason to throw away the more typical installation where the drive
cache is used directly.
This is particularly true for people using SSDs. Sure, a few years
from now I expect most SSDs will be able to flush unwritten dirty data
to local flash. It hasn't happened yet and it doesn't help with
layered caches in the storage path anyway.
I will reiterate that when one is playing with multi-terrabyte
filesystems, the margin for error is significantly reduced.
Power loss events WILL OCCUR. Firmware crashes WILL OCCUR.
Power supplies still blow up. It makes no sense to ignore these
sources of error.
* UPSs are great, I have one... but properly powering down systems
attached to a UPS is actually not trivial. In all my years using
UPSes through power failures systems have only powered down properly
75% of the time. The other 25%... those wound up being hard
power-downs. Bye bye disk cache (and often bye-bye drive, but
that's another matter).
* NFS and other multi-layered filesystems depend on proper
synchronization mechanics for reboot recovery to work
properly. The more layers you have, the more likely something
will break and all your assumptions will go flying out the
* VM's can't cache or optimize I/O's if you are forced to use
sync-to-media for every I/O because you can't depend on disk
flushing. For example, a a FreeBSD client running on a linux
host. Goodbye intermediate caching layer if the linux host
I've run VMs on windows boxes where the windows box hardlocks
and completely destroys the 'drive cache'. I found out the
hard way that some VMs ignore the disk synchronization command,
even! But I don't expect that to last long as VMs become more
Basically it comes down to (1) Retaining the ability for devices and
intermediate platforms to properly cache and optimize write I/O,
(2) The fallacy of the assumption that nothing matters unless caches
are battery-backed, and (3) In large scale systems the assumptions
for data integrity have extremely serious consequences if that
promise of data integrity turns out to be not quite true in all
circumstances. Human error guarantees that.
So I would not-so-humbly suggest that proper media flush semantics
be implemented for any UFS journaling implementation, particularly
one done on top of softupdates. PARTICULARLY if you want to get rid
of fsck for real. For meta-data, of course. write()+fsync()
operations which can be flushed with a single log entry and no meta-data
writes could use relaxed semantics (though IMHO not as a default).
<dillon at backplane.com>
More information about the Kernel