Notes on disk flush commands and why they are important.

Matthew Dillon dillon at
Sun Jan 10 12:15:01 PST 2010

    When I think about power-down situations I've always been scared
    to death of pulling the plug on a normal hard drive that might be
    writing (for good reason), but SSDs are an entirely different matter.

    Even when testing HAMMER I don't pull power on a real HD.  I do it
    with a USB stick (Hmm I should probably buy a SSD now that capacities
    are getting reasonable and do depowering tests with it).  Testing
    HAMMER on a normal HD involves only pulling the SATA port :-).

    Pulling USB sticks and HD SATA ports with a live-mounted filesystem
    doing heavy writing is rather fun.  Even though USB doesn't generally
    support media sync those cheap sticks tend to serialize writes anyway
    (having no real cache on-stick) so it is a reasonable simulation.

    I don't know if gjournal is filesystem-aware.  One of the major issues
    with softupdates is that there is no demarkation point that you can
    definitively rollback to which guarantees a clean fsck.  On the
    otherhand, even though a journal cannot create proper bulk barriers
    without being filesystem-aware the journal can still enforce
    serialization of write I/O (from a post-recovery rollback standpoint),
    and that would certainly make a big difference with regards to fsck
    not choking on misordered data.

    Scott mentioned barriers vs BIO_FLUSH.  I was already assuming that
    Jeff's journaling code at least used barriers (i.e. waits for all
    prior I/O to complete before issuing dependent I/O).  That is mandatory
    since both NCQ on the device and (potentially) bioqdisksort() (if it
    is still being used) will reorder write BIOs in-progress.

    In an environment where a very high volume of writes is being pipelined
    into a hard drive the hard drive's own ram cache will start stalling the
    write BIOs and there will be a continuous *SEVERE* reordering of the
    data as it gets committed to the media.  BIO_FLUSH is the only safe
    way to deal with that situation.

    I strongly believe that the use of BIO_FLUSH is mandatory for any
    meta-data updates.  One can optimize the write()+fsync() path as a
    separate journal-only intent log entry which does not require a
    BIO_FLUSH (as it would not involve any meta-data updates at
    all to satisfy the fsync() requirements), and continue to use proper
    BIO_FLUSH semantics for the actual softupdates-related updates.

    By default, in HAMMER, fsync() will use BIO_FLUSH anyway, but I'm
    working on a more relaxed feature this week which does precisely what
    I described.... writes small amounts of write() data directly into the
    REDO log to satisfy the fsync() requirements and then worries about
    the actual data/meta-data updates later.  The BIO_FLUSH for *JUST*
    that logical log entry then becomes optional.


    I see another issue with the SUJ stuff though it is minor in comparison
    to the others.  It is not a good idea to depend on a CRC to validate
    log records in the nominal recovery case.  That is, the CRC should only
    be used to detect hard failure cases such as actual data corruption.
    What I do with HAMMER's UNDO/REDO log is place a log header with a
    sequence number on every single 512 byte boundary, as well as preformat
    the log area to guarantee that all lost sector writes are detected and
    to guarantee that no stale data will ever be misinterpreted as a log
    entry, without having to depend on the CRC (which I also have, of course).
    Large UNDO/REDO records are broken down into smaller pieces as necesary
    so as not to cross a 512-byte boundary.

    If one does not do this then the nominal failure with misordered writes
    can lay down the log header in one sector but fail to have laid down
    the rest of the log record in one or more other sectors that the record
    covers.  One must then rely on the CRC to detect the case which is
    dangerous because any mis-parsed information in the journal can destroy
    the filesystem faster then normal corruption would, and the prior 
    contents of the log being overwritten will be partially valid or
    patterned and might defeat the CRC.  The sequence number check is only
    sufficient for scan-detecting the span of the log if the disk offset
    in question only EVER contains log headers and not log bodies.  i.e.
    is either aligned or part of an atomic sector (aka 512 bytes on-media)
    and can never contain data at that particular offset.

    From the code documentation the jsegrec (overall) structure appears
    to be able to span multiple disk sectors.


More information about the Kernel mailing list