Hammer: Transactional file updates
dillon at apollo.backplane.com
Tue Aug 19 19:03:54 PDT 2008
* Unix in general guarantees atomicy between read() and write()
operations. This is done at the OS level, not by the filesystem.
It doesn't matter how large the write() is. This guarantee is
only between processes and does not apply to crash recovery.
* Unix in general does not guarantee atomicy between mmap-read or
mmap-written blocks and read() or write() ops. This is because
it has no way to know what the user program actualy wants when,
since all memory accesses are page-based.
* No filesystem I know of (except maybe BeOS's?) guarantees
transactional atomicy for I/O on multiple files or for multiple
I/O operations on a single file which must be treated atomically.
i.e. the situation where program #1 wants to issue several reads
and writes as a single transaction and program #2 wants to issue
several reads and writes as a single transaction, and for the
programs to see either an all-or-nothing related to the other
Neither ZFS nor HAMMER have such a feature insofar as I know,
for either the non-crash case or the crash recovery case.
Theoretically HAMMER can arrange flush groups to allow transactions
to cover multiple files and operations up to a certain point,
to handle the crash recovery case. Lots of additional logic would
be required to handle both the crash and non-crash cases.
I'm pretty sure ZFS has no way to do this.... block pointer updates
are block oriented and discrete and not suited for high level
* No filesystem I know of guarantees write atomicy across a crash
on the basis of the I/O size specified in a single write().
A large write() can exhaust filesystem caches and forces a flush
prior to the write() completing.
* Many filesystems have the characteristic of almost universally
operating atomically for small writes within the filesystem block
size (typically 8-64K), across a crash situation. That is, even
if the write() covers multiple sectors if it fits within the
filesystem block abstraction then it can be recovered atomically.
ZFS and HAMMER would be good examples. UFS would not.
* Filesystems such as HAMMER theoretically can guarantee the atomicy
for medium-sized writes, as long as all the operations fit into
a single flush group. This would be on the order of a
several-megabyte write. Such a feature could also be made to work
across several files.
However, HAMMER currently has no API to implement such a guarantee.
* ZFS updates are block oriented I believe, which means that ZFS makes
no guarantee... that is, ZFS can wind up breaking up large writes
or even medium size writes which cross the filesystem block size
boundary into separate updates on the disk.
My understanding is that, as with HAMMER, ZFS could theoretically
make such guarantees within the context of a single file, by delaying
the updating of the block pointers (for ZFS), and arranging things
in the same flush group (HAMMER), but probably not within the
context of multiple files. I could be wrong here, I don't know if
ZFS also implements a forward log and/or UNDO FIFO or not.
In addition, it should be noted that multiple writes are not
guaranteed to be flushed to disk in the same order as they were
written, at least not without issuing a fsync() after each write().
That can present even worse problems.
(Neither ZFS nor HAMMER guarantee write() ordering when recovering
from a crash insofar as I know).
To make a fully transaction-capable filesystem would require major
kernel support as well as filesystem support. It isn't something
that a filesystem can do without the kernel or the kernel can do
without the filesystem. The kernel support would handle the operational
cases while the filesystem support would be required to handle the
crash recovery cases.
<dillon at backplane.com>
More information about the Users