Initial filesystem design synopsis.

Matthew Dillon dillon at
Thu Feb 22 01:08:26 PST 2007

:>     - The record table consists of fixed-sized records and a reference to
:>       data in the data table.  The record table is built backwards from
:>       the end of the segment.
:Doesn't this prepending stuff incur a significant performance penalty 
:for operations that walk the record table in a chronological/otherwise 
:'fifo' ordered fashion?

    There is no reason why it would.  Records are small fixed-sized
    entities while disk I/O's tend to be much larger.  Disks cache
    whole tracks anyhow (and in fact most disks record sectors in
    reverse order on the track).  So, basically it doesn't matter
    with regards to accessing the record array.

:>     Record destruction creates holes in both the data table and the record
:>     table.  Any holes adjacent to the data table append point or the record
:>     table prepend point are immediately recovered by adjusting the 
:>     appropriate indices in the segment header.  The operating system may
:>     cache a record of non-adjacent holes (in memory) and reuse the space,
:>     and can also generate an in-memory index of available holes on the
:>     fly when space is very tight (which requires scanning the record table),
:>     but otherwise the recovery of any space not adjacent to the data table
:>     append point requires a performance reorganization of the segment.
:I think these lists/trees should be kept sorted, at least on-disk for 
:performance reasons (random reads/writes on rotational media is a bummer 
:given current seek times).

    You can't reorganize the record array without doing a performance
    reorganization of the segment (as defined by the document).  The reason
    is that the records in question represent information that we cannot
    afford to lose, and there is no easy way to order disk I/O to maintain
    an array in sorted order without potentially losing data if an
    operating system crash were to occur during the I/O.  (Note I'm not
    talking about a disk crash here, I'm talking about an OS crash).

    Also, the concept of a 'sorted order' is not entirely applicable to
    a historical data store where the historical (deleted) records will
    overlap the current records in many different ways.  Trying to keep
    track of it all in a single sort leads to severe inefficiencies when
    doing lookups on files with a lot of historical records.

    Instead one must maintain an index of the records separate from the
    record list itself (maybe even more then one index, in fact).  The 
    index can be in-memory or on-disk.  There's a nice trick one can do 
    with indexes... there is no need to insert new records into an index
    when they are created.  Instead you can also maintain a very short
    list of, say, 8 unindexed records which are always checked against 
    in any search and then insert the unindexed records into the actual
    index in-bulk, with cache locality of reference for both code and
    data (which is far more efficient then inserting them one at a time
    when they were originally prepended).

:Generally, I can't help but feel that the clustering/replication stuff 
:needs to be separate from the 'actual on-disk' filesystem.
:         Thomas E. Spanjaard
:         tgen at

    One thing I've learned from doing the Backplane database is that
    you can't separate the storage topology from the clustering algorithms
    without creating a huge mess.   They have to be basically compatible
    with each other for things to operate smoothly.


More information about the Kernel mailing list