Initial filesystem design synopsis.
dillon at apollo.backplane.com
Thu Feb 22 01:08:26 PST 2007
:> - The record table consists of fixed-sized records and a reference to
:> data in the data table. The record table is built backwards from
:> the end of the segment.
:Doesn't this prepending stuff incur a significant performance penalty
:for operations that walk the record table in a chronological/otherwise
:'fifo' ordered fashion?
There is no reason why it would. Records are small fixed-sized
entities while disk I/O's tend to be much larger. Disks cache
whole tracks anyhow (and in fact most disks record sectors in
reverse order on the track). So, basically it doesn't matter
with regards to accessing the record array.
:> Record destruction creates holes in both the data table and the record
:> table. Any holes adjacent to the data table append point or the record
:> table prepend point are immediately recovered by adjusting the
:> appropriate indices in the segment header. The operating system may
:> cache a record of non-adjacent holes (in memory) and reuse the space,
:> and can also generate an in-memory index of available holes on the
:> fly when space is very tight (which requires scanning the record table),
:> but otherwise the recovery of any space not adjacent to the data table
:> append point requires a performance reorganization of the segment.
:I think these lists/trees should be kept sorted, at least on-disk for
:performance reasons (random reads/writes on rotational media is a bummer
:given current seek times).
You can't reorganize the record array without doing a performance
reorganization of the segment (as defined by the document). The reason
is that the records in question represent information that we cannot
afford to lose, and there is no easy way to order disk I/O to maintain
an array in sorted order without potentially losing data if an
operating system crash were to occur during the I/O. (Note I'm not
talking about a disk crash here, I'm talking about an OS crash).
Also, the concept of a 'sorted order' is not entirely applicable to
a historical data store where the historical (deleted) records will
overlap the current records in many different ways. Trying to keep
track of it all in a single sort leads to severe inefficiencies when
doing lookups on files with a lot of historical records.
Instead one must maintain an index of the records separate from the
record list itself (maybe even more then one index, in fact). The
index can be in-memory or on-disk. There's a nice trick one can do
with indexes... there is no need to insert new records into an index
when they are created. Instead you can also maintain a very short
list of, say, 8 unindexed records which are always checked against
in any search and then insert the unindexed records into the actual
index in-bulk, with cache locality of reference for both code and
data (which is far more efficient then inserting them one at a time
when they were originally prepended).
:Generally, I can't help but feel that the clustering/replication stuff
:needs to be separate from the 'actual on-disk' filesystem.
: Thomas E. Spanjaard
: tgen at netphreax.net
One thing I've learned from doing the Backplane database is that
you can't separate the storage topology from the clustering algorithms
without creating a huge mess. They have to be basically compatible
with each other for things to operate smoothly.
More information about the Kernel