Hammer: Transactional file updates

Dmitri Nikulin dnikulin at gmail.com
Tue Aug 19 16:31:01 PDT 2008

On Tue, Aug 19, 2008 at 2:14 AM, Joerg Sonnenberger
<joerg at britannica.bec.de> wrote:
> On Sun, Aug 17, 2008 at 08:40:20PM +1000, Dmitri Nikulin wrote:
>> I personally believe that Unix should have had a transactional file IO
>> API from the start, so that all modern file systems would implement it
>> and atomicity would be the standard, not a rare feature.
> I am not exactly sure what you mean with "atomicity", but can you
> demonstrate even *one* filesystem where writes of two processes are
> atomic relative to each other? I don't know any.

The COW approach in ZFS appears to do exactly that. The block pointer
is not updated to the new copy of the block until it's finished
writing, so one process can write and the other won't read the new
block until the write completes and therefore the block pointer is
updated. That's regardless of whether or not the change has been
finalised on-disk.

> There are also very good reasons why Unix filesystem IO never was
> transactional. It is way too expensive and complex to allow that.

I never said transactional IO should be the *only* way of doing IO.
Having a transactional API available would mean most file systems
would support it at least within the requirements. Most applications
that really need it end up doing it their own way, which is fine in
theory, but on some relatively unsafe filesystems like UFS and ext2,
even a very thorough transactional file format (e.g. SQLite) can be
corrupted, especially if the operation involves changing the metadata
like the file size, which is something you do all the time while
accumulating database inserts.

Even on ext3 in writeback (aka fast but unsafe) mode you can have a
situation where the metadata has been updated but the actual file data
hasn't been written yet, so it's anyone's guess what's actually in the
file on the next boot. The solution is to use full journaling (very
slow) or at least ordered mode (default). Most filesystems in use
today have neither!

Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia

More information about the Users mailing list