HAMMER installed base
dillon at apollo.backplane.com
Mon Aug 3 00:18:18 PDT 2009
(I'll throw Mr Beck into the CC since this directly addresses some
of the issues he talked about).
:On Sun, Aug 2, 2009 at 8:03 AM, Matthew
:Dillon<dillon at apollo.backplane.com> wrote:
:> =A0 =A0There has been a lot of interest in HAMMER but so far very few peo=
:> =A0 =A0with the coding experience needed to actually do a port.
:Bob Beck from OpenBSD looked into it recently.
Yah, I'm very well aware how complex the buffer cache interface
is. It was the best I could do. Perhaps it will seem a bit less
daunting if I explain some of the issues.
Basically the buffer cache is used two ways. First there are
vnode-backed buffers which pretty much work like normal buffer
cache buffers. These are handled via hammer_vnops.c as part
of standard read() and write() VOPS. This first use requires
integration into the OS's buffer cache scheme since the OS's buffer
cache scheme is often also usually responsible for mmap vs file I/O
data coherency. Fortunately these buffers are also used almost
exactly the way OS's expect so it isn't a big deal.
Secondly there are the block device backed buffers which are used
for meta-data (and also some forms of file data, read on), and have
some major requirements. This use does not fit into other OS's buffer
cache schemes very well, though Linux's might fit better then e.g.
NetBSD, OpenBSD, or FreeBSD's. This is the tough one, but I did take
it into account and it should be almost entirely encapsulated within
hammer_io.c. So it is possible to implement these sorts of buffers
independantly of the vnode-backed buffers.
The second type of buffers would require major modifications to
the other OS's buffer cache code if one wanted to use the system
buffer cache code directly. Any serious port would probably have to
do a roll-your-own implementation that does not use the system
buffer cache for the device backed buffers. There are many reasons for
this and it really isn't possible to simplify HAMMER to make the
device buffers conform well to generic OS buffer cache code.
* Serialization requirements. Certain types of buffers must be flushed
before other types, with an actual disk flush command inbetwen the
two sets. e.g. UNDO buffers must be flushed before the UNDO FIFO
pointers in the volume header can be updated (and volume header
flushed), and that all must occur before buffers related to meta-data
can be flushed.
* Veto requirements for read-only mounts and panic shutdowns. In a panic
the absolute last thing we want to have happen is for the OS to flush
the dirty buffers to disk out of order (or at all, really).
* Aliasing between frontend vnode-backed buffers and backend blockdevice-
backed buffers. The same data can be aliased in both under numerous
circumstances. These aren't critical path issues but things like
the HAMMER mirroring code and reblocking code (particularly the
reblocking code) access data via the block device buffers where as
the frontend code (aka read() and write()) access data through
vnode-based buffer cache buffers. The vnode is simply not available
to the reblocking code... it operates directly on HAMMER's on-media
* Packed data. HAMMER packs small bits of data with a 64-byte
granularity, so data aliasing will occur for these bits of data
between the two types of buffers. UFS-style fragment overloading
(using a frontend buffer to issue device-backed I/O for the fragment)
does not work.
* I/O sequencing. Buffer I/O strategy calls are asynchronous, and the
callback may have to deal with numerous situations in addition to
the kernel itself requesting buffer flushes (due to normal buffer
* Clean buffers cannot be thrown away ad-hoc by the OS. HAMMER must
be informed of all buffer cache actions made by the OS so it can
keep the proper internal structure associations with the buffer
cache buffers. This applies to the block device backed buffers.
* The filesystem block size is not fixed. It is extent based though
the current implementation only has two extent sizes: 16KB and 64KB.
The buffer cache has to be able to mix the extent sizes.
The frontend buffer cache can get away with the use of 16K buffers
only, with some modifications to write(), but the block device buffer
DragonFly buffer cache bufferse use 64 bit byte offsets, not block
numbers, and can accomodate buffers of different sizes fairly easily.
Most BSDs can accomodate buffers of different sizes as long as they
are a multiple of some base buffer block size (which HAMMER's are).
I don't know about Linux.
So, lots of issues. For someone interested in doing a port I recommend
rolling your own hammer_io.c to implement the device buffers, and using
the OS's own buffer cache for the frontend (vnode-backed) buffers.
Doing this is fairly straight forward, the only potential complexity
would be in implementing memory pressure feedback to flush/clean-out
I did try very hard to simplify matters but it's just impossible.
The ordering requirements for the device-backed buffers cannot be
worked around. Fortunately the only ordering requirements for
frontend (vnode-backed buffers) is that they all get flushed before
the related meta-data.. a fairly easy thing to do using the OS's
own buffer cache.
I should note that these ordering requirements are not really
buffer-buffer ordering or serialization, but more like buffer class
ordering requirements. i.e. flush all UNDO in parallel, disk sync,
flush volume header, disk sync, construct and flush related meta-data
in parallel. That's a basic 'flush cycle' in HAMMER. The meta-data
flush from the previous flush cycle can run simultaniously with the
UNDO flush for the next flush cycle.
The callback/veto features in DragonFly's buffer cache were written
for HAMMER so everything could use the system buffer cache, and it is
very complex. The B_LOCKED stuff, for example, is how HAMMER veto's
an OS-requested operation. OS wants to clean or flush a buffer, HAMMER
doesn't want to let it, HAMMER sets B_LOCKED. Rolling your own block
device cache (rolling your own hammer_io.c) does away with most of
that OS<->HAMMER complexity... that is, it all becomes internalized
within HAMMER and the OS doesn't need to know about it at all.
<dillon at backplane.com>
More information about the Hammer