Initial filesystem design synopsis.

Thomas E. Spanjaard tgen at
Wed Feb 21 15:55:20 PST 2007

Matthew Dillon wrote:

    The physical storage backing a filesystem is broken up into large
    1MB-4GB segments (64MB is a typical value).  Each segment is
    self-identifying and contains its own header, data table, and record
    table.  The operating system glues together filesystems and determines
    availability based on the segments it finds.
I think the more common term for this kind of thing is 'allocation group'.

    - The data table consists of pure data, laid out linearly in the forward
      direction within the segment.   Data blocks are variable-sized entities
      containing pure data, with no other identifying information, suitable
      for direct DMA.  The segment header has a simple append index for
      the data table.
And 'extent' for the variable-sized entities :).

    - The record table consists of fixed-sized records and a reference to
      data in the data table.  The record table is built backwards from
      the end of the segment.
Doesn't this prepending stuff incur a significant performance penalty 
for operations that walk the record table in a chronological/otherwise 
'fifo' ordered fashion?

    Record destruction creates holes in both the data table and the record
    table.  Any holes adjacent to the data table append point or the record
    table prepend point are immediately recovered by adjusting the 
    appropriate indices in the segment header.  The operating system may
    cache a record of non-adjacent holes (in memory) and reuse the space,
    and can also generate an in-memory index of available holes on the
    fly when space is very tight (which requires scanning the record table),
    but otherwise the recovery of any space not adjacent to the data table
    append point requires a performance reorganization of the segment.
I think these lists/trees should be kept sorted, at least on-disk for 
performance reasons (random reads/writes on rotational media is a bummer 
given current seek times).

Generally, I can't help but feel that the clustering/replication stuff 
needs to be separate from the 'actual on-disk' filesystem.

        Thomas E. Spanjaard
        tgen at
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgp00015.pgp
Type: application/octet-stream
Size: 186 bytes
Desc: "Description: OpenPGP digital signature"
URL: <>

More information about the Kernel mailing list