Notes on disk flush commands and why they are important.
dillon at apollo.backplane.com
Sun Jan 10 12:15:01 PST 2010
When I think about power-down situations I've always been scared
to death of pulling the plug on a normal hard drive that might be
writing (for good reason), but SSDs are an entirely different matter.
Even when testing HAMMER I don't pull power on a real HD. I do it
with a USB stick (Hmm I should probably buy a SSD now that capacities
are getting reasonable and do depowering tests with it). Testing
HAMMER on a normal HD involves only pulling the SATA port :-).
Pulling USB sticks and HD SATA ports with a live-mounted filesystem
doing heavy writing is rather fun. Even though USB doesn't generally
support media sync those cheap sticks tend to serialize writes anyway
(having no real cache on-stick) so it is a reasonable simulation.
I don't know if gjournal is filesystem-aware. One of the major issues
with softupdates is that there is no demarkation point that you can
definitively rollback to which guarantees a clean fsck. On the
otherhand, even though a journal cannot create proper bulk barriers
without being filesystem-aware the journal can still enforce
serialization of write I/O (from a post-recovery rollback standpoint),
and that would certainly make a big difference with regards to fsck
not choking on misordered data.
Scott mentioned barriers vs BIO_FLUSH. I was already assuming that
Jeff's journaling code at least used barriers (i.e. waits for all
prior I/O to complete before issuing dependent I/O). That is mandatory
since both NCQ on the device and (potentially) bioqdisksort() (if it
is still being used) will reorder write BIOs in-progress.
In an environment where a very high volume of writes is being pipelined
into a hard drive the hard drive's own ram cache will start stalling the
write BIOs and there will be a continuous *SEVERE* reordering of the
data as it gets committed to the media. BIO_FLUSH is the only safe
way to deal with that situation.
I strongly believe that the use of BIO_FLUSH is mandatory for any
meta-data updates. One can optimize the write()+fsync() path as a
separate journal-only intent log entry which does not require a
BIO_FLUSH (as it would not involve any meta-data updates at
all to satisfy the fsync() requirements), and continue to use proper
BIO_FLUSH semantics for the actual softupdates-related updates.
By default, in HAMMER, fsync() will use BIO_FLUSH anyway, but I'm
working on a more relaxed feature this week which does precisely what
I described.... writes small amounts of write() data directly into the
REDO log to satisfy the fsync() requirements and then worries about
the actual data/meta-data updates later. The BIO_FLUSH for *JUST*
that logical log entry then becomes optional.
I see another issue with the SUJ stuff though it is minor in comparison
to the others. It is not a good idea to depend on a CRC to validate
log records in the nominal recovery case. That is, the CRC should only
be used to detect hard failure cases such as actual data corruption.
What I do with HAMMER's UNDO/REDO log is place a log header with a
sequence number on every single 512 byte boundary, as well as preformat
the log area to guarantee that all lost sector writes are detected and
to guarantee that no stale data will ever be misinterpreted as a log
entry, without having to depend on the CRC (which I also have, of course).
Large UNDO/REDO records are broken down into smaller pieces as necesary
so as not to cross a 512-byte boundary.
If one does not do this then the nominal failure with misordered writes
can lay down the log header in one sector but fail to have laid down
the rest of the log record in one or more other sectors that the record
covers. One must then rely on the CRC to detect the case which is
dangerous because any mis-parsed information in the journal can destroy
the filesystem faster then normal corruption would, and the prior
contents of the log being overwritten will be partially valid or
patterned and might defeat the CRC. The sequence number check is only
sufficient for scan-detecting the span of the log if the disk offset
in question only EVER contains log headers and not log bodies. i.e.
is either aligned or part of an atomic sector (aka 512 bytes on-media)
and can never contain data at that particular offset.
From the code documentation the jsegrec (overall) structure appears
to be able to span multiple disk sectors.
More information about the Kernel