HAMMER filesystem update - design document

Sat Oct 13 18:05:41 PDT 2007

:Anyone up on ReiserFS ?
:
:(but still capable of a 'clean room' description :)
:...
:As I recall, according to their docs it seems to have been one of the
:first to use BTrees in the general sense for internal structuring ..
:
:also as I recall, there were some performance problems in specific areas
:of requiring extra CPU for basic IO (due to having to compute tree
:operations rather than do simple pointer manipulations) and also
:concurrent IO (due to the need for complex tree locking types of things,
:possibly compounded by the extra cpu time)
:
:this kind of a thing is more for replication than 100% raw speed, but
:in any case.. just some topics for discussion I suppose ..
:
:I still need to read the more detailed commits.
:
:looking forward to it.
:
:- Chris

    I've never looked at the Reiser code though the comments I get from
    friends who use it are on the order of 'extremely reliable but not
    the fastest filesystem in the world'.

    I don't expect HAMMER to be slow.  A B-Tree typically uses a fairly
    small radix in the 8-64 range (HAMMER uses 8 for now).  A standard
    indirect block methodology typically uses a much larger radix, such
    as 512, but is only able to organize information in a very restricted,
    linear way.

    The are several tricks to making a B-Tree operate efficiently but
    the main thing you want to do is to issue large I/Os which cover
    multiple B-Tree nodes and then arrange the physical layout of the B-Tree
    such that a linear I/O will cover the most likely path(s), thus
    reducing the actual number of physical I/O's needed.  Locality of 
    reference is important.

    HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree
    operations, because it doesn't need an intact B-Tree for recovery of the
    filesystem.  It can reconstruct the B-Tree for a cluster by scanning
    the records in the cluster and using a stored transaction id verses the
    transaction id in the records to determine what can be restored and
    what still may have had pending asynchronous I/O and thus cannot.

    HAMMER will implement one B-Tree per cluster (where a cluster is e.g.
    64MB), and then hook clusters together at B-Tree leaf nodes (B-Tree
    leaf -> root of some other cluster).  This means that HAMMER will be
    able to lock modifying operations cluster-by-cluster at the very
    least and hopefully greatly improve the amount of parallelism supported
    by the filesystem.

    HAMMER uses a index-record-data approach.  Each cluster has three types
    of information in it:  Indexes, records, and data.  The index is a B-Tree
    and B-Tree nodes will replicate most of the contents of the records as
    well as supply a direct pointer to the related data.  B-Tree nodes will
    be localized in typed filesystem buffers (that is, grouped with other
    B-Tree nodes), and B-Tree filesystem buffers will be intermixed with
    data filesystem buffers to a degree, so it should have extremely
    good caching characteristics.  I tried to take into consideration how 
    hard drives cache data (which is typically whole tracks to begin with)
    and incorporate that into the design.

    Finally, HAMMER is designed to allow clusters-by-cluster reoptimization
    of the storage layout.  Anything that isn't optimally layed-out at the
    time it was created can be re-layed-out at some later time, e.g. with
    a continuously running background process or a nightly cron job or 
    something of that ilk.  This will allow HAMMER to choose to use an
    expedient layout instead of an optimal one in its critical path and then
    'fix' the layout later on to make re-accesses optimal.  I've left a ton
    of bytes free in the filesystem buffer headers for records and clusters
    for (future) usage-pattern tracking heuristics.

    The radix tree bitmap allocator, which has been committed so you can
    take a look at it if you want, is extremely sophisticated.  It should
    be able to optimally allocate and free various types of information
    all the way from megabyte+ sized chunks down to a 64-byte boundary,
    in powers of 2.

    My expectation is that this will lead to a fairly fast filesystem.  We
    will know in about a month :-)

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>