syslink development update, HAMMER/ANVIL filesystem development update

Matthew Dillon dillon at apollo.backplane.com
Mon Apr 9 11:13:39 PDT 2007


    I'm making good progress on the syslink code.  The system call keeps
    morphing but in the next commit it should settle down.

    I ran into a bit of a problem with the 'syslink route node' concept
    when I decided I wanted to be able to attach a UDP broadcast socket
    to a syslink route node.  Originally the concept was to attach 
    individual connections only, but I realized that it made little sense
    not to take advantage of hardware-switched broadcast mechanisms
    since the leafs of most clusters were likely to be operating over
    switched networks.

    Attaching a broadcast socket to a syslink route node requires an
    aligned subset of the route node's address space to be associated with
    the socket (in order to represent the network's subnet space and be able
    to directly map some of the address bits to the IP space).  After
    some fooling around I decided to adapt the SUBR_BLIST code to the task
    and have committed SUBR_ALIST as a result.  This will allow the
    syslink route node to trivially attach mixed-size subnets (down to
    single entities).  I hope to have the syslink mesh infrastructure done
    in about another week or so and will then start working on the protocols
    and kernel device/filesystem namespace interfacing.

    --

    SUBR_BLIST is a bitmap allocator which I originally developed for FreeBSD
    which manages swap space allocations.  It is implemented using a radix
    tree with hinting.  SUBR_ALIST is the same thing, but designed for
    power-of-2 allocations and guarentees alignment to the allocation size.
    e.g. a 4KB allocation would be aligned to a 4KB boundary.

    It turns out the SUBR_ALIST has other interesting characteristics that
    make it suitable for the cluster filesystem design. And, in fact, the
    design characteristics also make the ALIST allocator suitable for
    managing physical memory, particularly in its ability to allow us to
    cut up memory into chunks which are properly aligned no matter what
    the page size used.  It would be possible to manage page segment
    mappings entirely dynamically if we wanted to go that direction (not
    on my list, though).

    --

    The current filesystem design revolves around the filesystem layer,
    HAMMER, and the storage layer, ANVIL.

    ANVIL will be responsible for very large block reservations (e.g.
    1MB to 1TB 'blocks' or even larger).  These blocks will be
    self-identifying and assigned to filesystems and/or logical block
    devices.  The key feature of ANVIL will be that since these blocks are
    self-identifying, they can be moved across to different media without
    upper layers knowing or caring.  For example, you could construct a
    filesystem based on ANVIL blocks you allocate from a single disk
    drive, and on a live system migrate some or all of those ANVIL blocks
    to another disk drive.  You could physically move a disk drive from
    one machine to another and the cluster mesh would recognize it, no
    copying or reconfiguring required.

    In order to properly detect ANVIL blocks on physical media, the blocks
    must be aligned in a way that a scan of the physical media is able to
    locate them without any other knowledge.  This will be accomplished
    by doing a power-of-2 scan.  For example, a 256GB hard drive would be
    scanned by checking the block at the 128GB mark, then all the 64GB marks,
    and so on to locate the ANVIL block headers on the media and build an
    allocation/management map in kernel memory.  In addition, it will be
    possible to allocate (and deallocate) ANVIL blocks on the fly (for
    example, to resize a filesystem).  Just like cutting up an address
    space into variable-sized subnets, ANVIL blocks can also be variable
    sized as long as they adhere to the alignment requirements.

    The ANVIL layer will scan available physical media for ANVIL blocks
    and piece them together to higher level entities such as filesystems.
    One of the really nice things about ANVIL is that it won't care where
    it gets the pieces from.

    It turns out that the ALIST allocator is the perfect solution to this
    problem because the ALIST alignment requirements are precisely the
    same as the ANVIL alignment requirements.  For example, lets say we
    had a 1TB hard drive and wanted to allow 1GB-1TB ANVIL blocks to be
    allocated out of it.  The ALIST allocation map would thus have to
    support up to 1024 1GB blocks.  Out of this 1TB drive we might allocate,
    say, a few dozen 1GB blocks, a few dozen 16GB blocks, maybe a large
    128GB block, and so forth.  The ALIST allocator would be used to do
    this and eat up only around 384 bytes of kernel memory to manage the
    1024 'block' space.

    --

    For HAMMER -- the filesystem design I am working on, I am looking
    into being able to use ALISTs for extent management within the
    filesystem.  This research is very early, though, and I don't know
    how useful ALISTs will be in HAMMER due to the time domain snapshot
    mechanisms I want to implement.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Kernel mailing list