Hammer: Transactional file updates

Tue Aug 19 19:03:54 PDT 2008

    * Unix in general guarantees atomicy between read() and write()
      operations.  This is done at the OS level, not by the filesystem.
      It doesn't matter how large the write() is.  This guarantee is
      only between processes and does not apply to crash recovery.

    * Unix in general does not guarantee atomicy between mmap-read or
      mmap-written blocks and read() or write() ops.  This is because
      it has no way to know what the user program actualy wants when,
      since all memory accesses are page-based.

    * No filesystem I know of (except maybe BeOS's?) guarantees
      transactional atomicy for I/O on multiple files or for multiple
      I/O operations on a single file which must be treated atomically. 
      i.e. the situation where program #1 wants to issue several reads
      and writes as a single transaction and program #2 wants to issue
      several reads and writes as a single transaction, and for the
      programs to see either an all-or-nothing related to the other
      programs operations.

      Neither ZFS nor HAMMER have such a feature insofar as I know,
      for either the non-crash case or the crash recovery case.

      Theoretically HAMMER can arrange flush groups to allow transactions
      to cover multiple files and operations up to a certain point,
      to handle the crash recovery case.  Lots of additional logic would
      be required to handle both the crash and non-crash cases.

      I'm pretty sure ZFS has no way to do this.... block pointer updates
      are block oriented and discrete and not suited for high level
      transactions.

    * No filesystem I know of guarantees write atomicy across a crash
      on the basis of the I/O size specified in a single write().
      A large write() can exhaust filesystem caches and forces a flush
      prior to the write() completing.

    * Many filesystems have the characteristic of almost universally 
      operating atomically for small writes within the filesystem block
      size (typically 8-64K), across a crash situation.  That is, even
      if the write() covers multiple sectors if it fits within the 
      filesystem block abstraction then it can be recovered atomically.

      ZFS and HAMMER would be good examples.  UFS would not.

    * Filesystems such as HAMMER theoretically can guarantee the atomicy
      for medium-sized writes, as long as all the operations fit into
      a single flush group.  This would be on the order of a
      several-megabyte write.  Such a feature could also be made to work
      across several files.

      However, HAMMER currently has no API to implement such a guarantee.

    * ZFS updates are block oriented I believe, which means that ZFS makes
      no guarantee... that is, ZFS can wind up breaking up large writes
      or even medium size writes which cross the filesystem block size
      boundary into separate updates on the disk.

      My understanding is that, as with HAMMER, ZFS could theoretically
      make such guarantees within the context of a single file, by delaying
      the updating of the block pointers (for ZFS), and arranging things
      in the same flush group (HAMMER), but probably not within the
      context of multiple files.  I could be wrong here, I don't know if
      ZFS also implements a forward log and/or UNDO FIFO or not.

    In addition, it should be noted that multiple writes are not 
    guaranteed to be flushed to disk in the same order as they were
    written, at least not without issuing a fsync() after each write().
    That can present even worse problems.

    (Neither ZFS nor HAMMER guarantee write() ordering when recovering
    from a crash insofar as I know).

    To make a fully transaction-capable filesystem would require major
    kernel support as well as filesystem support.  It isn't something
    that a filesystem can do without the kernel or the kernel can do
    without the filesystem.  The kernel support would handle the operational
    cases while the filesystem support would be required to handle the
    crash recovery cases.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>