HAMMER installed base

Mon Aug 3 00:18:18 PDT 2009

    (I'll throw Mr Beck into the CC since this directly addresses some
    of the issues he talked about).

:On Sun, Aug 2, 2009 at 8:03 AM, Matthew
:Dillon<dillon at apollo.backplane.com> wrote:
:> =A0 =A0There has been a lot of interest in HAMMER but so far very few peo=
:ple
:> =A0 =A0with the coding experience needed to actually do a port.
:>
:
:Bob Beck from OpenBSD looked into it recently.
:
:http://marc.info/?l=3Dopenbsd-misc&m=3D124819453921723&w=3D2
:
:Thanks
:...
:Siju

    Yah, I'm very well aware how complex the buffer cache interface
    is.  It was the best I could do.  Perhaps it will seem a bit less
    daunting if I explain some of the issues.

    Basically the buffer cache is used two ways.  First there are
    vnode-backed buffers which pretty much work like normal buffer
    cache buffers.  These are handled via hammer_vnops.c as part
    of standard read() and write() VOPS.  This first use requires
    integration into the OS's buffer cache scheme since the OS's buffer
    cache scheme is often also usually responsible for mmap vs file I/O
    data coherency.  Fortunately these buffers are also used almost
    exactly the way OS's expect so it isn't a big deal.

    Secondly there are the block device backed buffers which are used
    for meta-data (and also some forms of file data, read on), and have
    some major requirements.  This use does not fit into other OS's buffer
    cache schemes very well, though Linux's might fit better then e.g.
    NetBSD, OpenBSD, or FreeBSD's.  This is the tough one, but I did take
    it into account and it should be almost entirely encapsulated within
    hammer_io.c.  So it is possible to implement these sorts of buffers
    independantly of the vnode-backed buffers.

    The second type of buffers would require major modifications to
    the other OS's buffer cache code if one wanted to use the system
    buffer cache code directly.  Any serious port would probably have to
    do a roll-your-own implementation that does not use the system
    buffer cache for the device backed buffers.  There are many reasons for
    this and it really isn't possible to simplify HAMMER to make the
    device buffers conform well to generic OS buffer cache code.

    * Serialization requirements.  Certain types of buffers must be flushed
      before other types, with an actual disk flush command inbetwen the
      two sets.  e.g. UNDO buffers must be flushed before the UNDO FIFO
      pointers in the volume header can be updated (and volume header
      flushed), and that all must occur before buffers related to meta-data
      can be flushed.

    * Veto requirements for read-only mounts and panic shutdowns.  In a panic
      the absolute last thing we want to have happen is for the OS to flush
      the dirty buffers to disk out of order (or at all, really).

    * Aliasing between frontend vnode-backed buffers and backend blockdevice-
      backed buffers.  The same data can be aliased in both under numerous
      circumstances.  These aren't critical path issues but things like
      the HAMMER mirroring code and reblocking code (particularly the
      reblocking code) access data via the block device buffers where as
      the frontend code (aka read() and write()) access data through
      vnode-based buffer cache buffers.  The vnode is simply not available
      to the reblocking code... it operates directly on HAMMER's on-media
      B-Tree.

    * Packed data.  HAMMER packs small bits of data with a 64-byte
      granularity, so data aliasing will occur for these bits of data
      between the two types of buffers.  UFS-style fragment overloading
      (using a frontend buffer to issue device-backed I/O for the fragment)
      does not work.

    * I/O sequencing.  Buffer I/O strategy calls are asynchronous, and the
      callback may have to deal with numerous situations in addition to
      the kernel itself requesting buffer flushes (due to normal buffer
      cache pressure).

    * Clean buffers cannot be thrown away ad-hoc by the OS.  HAMMER must
      be informed of all buffer cache actions made by the OS so it can
      keep the proper internal structure associations with the buffer
      cache buffers.  This applies to the block device backed buffers.

    * The filesystem block size is not fixed.  It is extent based though
      the current implementation only has two extent sizes:  16KB and 64KB.
      The buffer cache has to be able to mix the extent sizes.

      The frontend buffer cache can get away with the use of 16K buffers
      only, with some modifications to write(), but the block device buffer
      cache cannot.

      DragonFly buffer cache bufferse use 64 bit byte offsets, not block
      numbers, and can accomodate buffers of different sizes fairly easily.

      Most BSDs can accomodate buffers of different sizes as long as they
      are a multiple of some base buffer block size (which HAMMER's are).
      I don't know about Linux.

    So, lots of issues.  For someone interested in doing a port I recommend
    rolling your own hammer_io.c to implement the device buffers, and using
    the OS's own buffer cache for the frontend (vnode-backed) buffers.
    Doing this is fairly straight forward, the only potential complexity
    would be in implementing memory pressure feedback to flush/clean-out
    the cache.

    I did try very hard to simplify matters but it's just impossible.
    The ordering requirements for the device-backed buffers cannot be
    worked around.  Fortunately the only ordering requirements for
    frontend (vnode-backed buffers) is that they all get flushed before
    the related meta-data.. a fairly easy thing to do using the OS's
    own buffer cache.

    I should note that these ordering requirements are not really
    buffer-buffer ordering or serialization, but more like buffer class
    ordering requirements.  i.e. flush all UNDO in parallel, disk sync,
    flush volume header, disk sync, construct and flush related meta-data
    in parallel.  That's a basic 'flush cycle' in HAMMER.  The meta-data
    flush from the previous flush cycle can run simultaniously with the
    UNDO flush for the next flush cycle.

    The callback/veto features in DragonFly's buffer cache were written
    for HAMMER so everything could use the system buffer cache, and it is
    very complex.  The B_LOCKED stuff, for example, is how HAMMER veto's
    an OS-requested operation.  OS wants to clean or flush a buffer, HAMMER
    doesn't want to let it, HAMMER sets B_LOCKED.  Rolling your own block
    device cache (rolling your own hammer_io.c) does away with most of
    that OS<->HAMMER complexity... that is, it all becomes internalized
    within HAMMER and the OS doesn't need to know about it at all.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>