Initial filesystem design synopsis.
dillon at apollo.backplane.com
Wed Feb 21 14:38:11 PST 2007
Here is my initial outline of the filesystem design. It is open
for discussion. Please feel to ask questions for anything you do
not understand. I do not intend to start coding anything for at
least two weeks.
There are currently two rough spots in the design. First, how to
handle segment overflows in a multi-master environment. Such overflows
can occur when the individual masters or slaves have different historical
data retention policies. Second, where to store the regeneratable
Plus, I need a name for this baby. I can't use DFS, however much I
want to, because the term is already over-used.
<dillon at backplane.com>
- On-demand filesystem check and recovery. No need to scan the entire
filesystem before going live after a reboot.
- Infinite snapshots (e.g. on the 30 second filesystem sync), ability
to collapse snapshots for any time interval as a means of
- Multi-master operation, including the ability to self-heal a
corrupted filesystem by accessing replicated data. Multi-master
means that each replicated store can act as a master. New
filesystem ops can run independantly on any of the replicated stores
and will flow to the others. This is done by giving the controller
of each store a certain amount of guarenteed free space that can be
filled without having to notify other controllers.
- Infinite logless Replication. No requirement to keep a log of changes
for the purposes of replication, meaning that replication targets can
be offline for 'days' without effecting performance or operation.
('mobile' computing, but also replication over slow links for backup
purposes and other things). Certain information is still logged,
but only to streamline performance.
- 64 bit file space, 64 bit filesystem space. No space restrictions
- Reliably handles data storage for huge multi-hundred-terrabyte
filesystems without fear of unrecoverable corruption.
- Cluster operation - ability to commit data to locally replicated
store independantly of other replication nodes, with access governed
by cache coherency protocols.
- Independant index. Data is laid out in a highly recoverable fashion,
independant of index generation. Indexes can be regenerated from
scratch and thus indexes can be updated asynchronously.
The physical storage backing a filesystem is broken up into large
1MB-4GB segments (64MB is a typical value). Each segment is
self-identifying and contains its own header, data table, and record
table. The operating system glues together filesystems and determines
availability based on the segments it finds.
All segments on a block device are the same size (power-of-2 restricted)
and the OS can probe the disk without any other information simply by
locating a segment header, starting with the largest possible segment
size. Segment size is typically chosen such that a scan of all segment
headers takes no longer then a few seconds.
- The segment header contains all the information required to associate
the segment with a filesystem. Certain information in the segment
header is updated with an ordered write to 'commit' other work already
flushed out into the segment. The segment header also contains a
bit indicating whether the segment is 'open' or not, to aid in
The segment ID is stored in the segment header, allowing segments
to be easily relocated. That is, the segment's location in the
physical backing store is irrelevant.
- The data table consists of pure data, laid out linearly in the forward
direction within the segment. Data blocks are variable-sized entities
containing pure data, with no other identifying information, suitable
for direct DMA. The segment header has a simple append index for
the data table.
- The record table consists of fixed-sized records and a reference to
data in the data table. The record table is built backwards from
the end of the segment. The segment header has a simple pre-append
index for the record table. Each record consists of:
* an object identifier (e.g. inode number)
* creation transid
* deletion transid (updated in-place)
* indexing key (offset or filename key)
* data table reference (offset, bytes) (up to a 1GB contiguous ref)
* integrity check
All data elements are completely identified by the record table. All
modifications are historical in nature... that is, no actual data is
destroyed and one can access a 'snapshot' of the segment as of any
transaction id (i.e. timestamp) simply by ignoring those records
marked as deleted prior to the specified time or created after
the specified time.
Certain updates to the object table occur in-place. In particular,
the deletion transaction id is updated in-place. However, such
updates are not considered 'committed' until the segment header itself
is updated with the latest committed transaction id and a recovery
operation will undo any deletion transaction id greater then the
committed transaction id in the segment header, as well as free
any uncommitted objects and their related data.
The entire filesystem can be recovered from just the record and data
tables. Indexing and cross-segment spanning is described in later
Physical space recovery:
Physical space recovery requires actually destroying records marked
as deleted. Snapshots can be collapsed by destroying records whos
creation AND deletion id's fall within the collapsed space. The oldest
data is freed up by destroying records whos deletion id is less then
the terminal timestamp.
Record destruction creates holes in both the data table and the record
table. Any holes adjacent to the data table append point or the record
table prepend point are immediately recovered by adjusting the
appropriate indices in the segment header. The operating system may
cache a record of non-adjacent holes (in memory) and reuse the space,
and can also generate an in-memory index of available holes on the
fly when space is very tight (which requires scanning the record table),
but otherwise the recovery of any space not adjacent to the data table
append point requires a performance reorganization of the segment.
Locating Data objects, and the Master Record.
Data objects are basically the amalgamation of all records with
the same 64 bit object identifier. The top N bits of this identifier
indicate the segment the master record for the data object resides in.
All 64 bits are made available to userland.
The filesystem needs to be free to relocate the master record for
a data object on the fly. Relocation is a dynamic process whereby
the master record in a segment becomes a forwarding record to another
segment. Any references to a forwarding record is adjusted on the fly
in order to collapse the chain, and any intermediate forwarding records
can be physically destroyed once all references to them have been
adjusted. However, the ORIGINAL forwarding record must be retained
for all time in order to maintain the uniqueness of the originally
assigned user-visible inode number and to give us a way to locate
the master record given the inode number. We cannot change the inode
number. Overhead is minimal.
A forwarding record is created in two situations: (1) To move the
master record to improve performance and (2) If the current segment
does not have sufficient space to increase the size the master record
if it becomes necessary to do so.
A forwarding record is a degenerate case of a master record and the
forwarding information is embedded in the record itself, with no
data table reference at all. The space used is only the space required
to store a record table entry.
The master record for a data object is a special record in the record
table. It is special because it is NOT part of the historical data
store. The creation and deletion transaction ids represent the creation
or deletion of the entire data object, and the data block contains
information on where to find the other bits and pieces of the data
object (in particular, if the data object spans more then one segment).
A CORRUPTED MASTER RECORD CAN BE COMPLETELY RECONSTRUCTED BY SCANNING
The master record can be thought of as a persistent cache. It gives
the filesystem a way to quickly locate all the segments that might
contain records for the data object and reserves a limited amount of
space for the filesystem to store direct references to data residing
in the same segment as the master record.
Master record updates are fairly rare. For the most part a master
record must only be updated if a file or directory grows into a
Segments can be repacked in order to clean up fragmentation and
reorganize data and records for more optimal access. Repacking is
accomplished by copying the segment's data into a wholely new
segment on the physical disk, then destroying the old segment.
Since a segment is identified by its header the actual location of
the segment on the physical media is irrelevant.
For example, master records can wind up strewn about the record
table for a segment. Repacking would pack all of those master
records next to each other.
Similarly, a file or directory might have bits and pieces of data
strewn about a segment. Repacking would de-fragment those entities,
as well as pack together the data related to small files and
separate out the larger chunks related to larger files.
Remember that since the actual physical location of a segment is
independant of the segment identifier (typically an 8 or 16 bit
number), the allocation of segment numbers does not have to be
linear. The filesystem will typically be generous in its allocation
of segment numbers in order to allow for spill over growth of files
into logically adjacent segments, thus simplifying the segment
range stored in the master record that describes which segments
might contain data records for a data object. For example,
the first segment allocated by the filesystem when using a 16 bit
segment id would not be 0x0000 or 0xffff, but probably 0x8000. The
next segment allocated (when not doing a spill-over) might be 0x4000
or 0xc000, and so forth. The entire segment space is used even
if there are fewer actual (physical) segments.
Large cross-segment objects:
A Data object can wind up being far larger then a segment. For that
matter, even a small data object with a lot of history can wind up being
far larger then a segment.
The filesystem does its best to organize the data for such objects
to reduce the size of the master records required to describe them
and to separate historical data from current data, to maintain the
performance of the live filesystem.
An object's master record generally describes the segments containing
the data for the object, and may contain direct references to data
in the same segment as the master record (an optimization or performance
reorganization that is desireable for files smaller then a single
However, data objects can grow to be many records due to fragmentation,
simply being large files, or due to the retention of a great deal of
history. The records pertaining to these objects may have to be indexed
beyond the space the filesystem is willing to reserve in the master
record in order to improve access.
To make it clear, indexing is an optimization. The index is not
required to recover a segment or to recover a filesystem. The intent
is for the master record to always contain sufficient information
to reduce the number I/O's required to resolve a file access request
to a reasonable number, even if no index is available.
Indexing can occur in three places:
* First, it can occur in the segment itself to organize the records
in that segment. Such indexes are usually persistently cached in the
dead space between the data table and the record table, and the
filesystem heuristically determines how much space is needed for
that. If the data table or the record table grows into the index
the filesystem can choose to relocate, regenerate, or shift the index
data, or to disallow the growth (extend the data object into a new
segment). These heuristics are fairly simple.
* Second, it can occur in the master record to roughly organize the
data space... for example so the filesystem does not have to check
all segments if a particular file offset (or offset range) and all
of its history is known to reside in just one. The filesystem
typically is only willing to allow so much space in the master record
for a data object to store such an index. If this level of index
becomes too large it is basically simplified in-place and starts
requiring the use of the per-segment index to further resolve the
location of data records for the object.
* Third, it can be generated and cached in memory. When dealing with
very large multi-segment files it may be beneficial to scan
the relatively few records in the record table for the related segments
and simply index them in memory, rather then store an index on-disk.
For example if you are using 64MB segments and have a 20GB file,
literally only 320 disk accesses (with a few data records read in
each access) are required to construct an index of the entire 20GB
file and the index itself would require very little memory.
Indexes are asynchronous entities. The 'official' filesystem store is
the data table and the record table. The host can cache updates in
memory, and asynchronously update the performance-improving index.
Indexes residing in segments are regenerated if the segment is marked
open on initial access (due to an unclean shutdown). This allows
persistent per-segment indexes to be updated without requiring any
particular transactional disk synchronization or block ordering.
Indexes are generally implemented using a B+Tree structure.
Data and record elements are only directly referenced by their offset
within a segment when referenced from within that same segment. This
means that replication is possible on a segment-by-segment basis.
Segment headers contain asynchronously updated record update log areas
for deletion events (because these modify the deletion timestamp in
an existing record rather then append a new record). These log areas
are of limited size and can overflow under normal operation. They are
ONLY used to streamline replication. If a replication target is not
fast enough, it will have to resynchronize by scanning the records
on the source (which itself doesn't usually take very long to do so it
isn't considered a big deal). But otherwise it can check the log area
and simply pick up where it left off. The intention is to completely
support both very slow replication slaves and mostly off-line slaves.
The only thing that is actually replicated are the record table
entries and (with the exception of a master record) the data table
entries. The data table entry for a master record is never replicated,
but (re)generated by the replication target. The replication target
is responsible for maintaining its own indexes and managing its own
physical segment space. This gives any replication target a great
deal of robustness and flexibility.
Also note that the physical location of the logical segments on the
replication target is entirely up to the replication target. Replication
is done logically, not physically.
Segment Expansion - Virtual segments
A replication target may wish to retain historical data that the
replication source has not, or destroy historical data that the
replication source intends to retain. This can result in the replication
target running out of room in a logical segment, and can also complicate
recovery operations if the replication source needs to self-heal a
corrupted segment. This presents a problem because all filesystem
accesses and all replication and recovery activities are segment-centric.
To deal with this situation a logical segment can be turned into a
virtual segment. A virtual segment is made up of multiple logical
segments, all with the same identifier plus additional information
in the segment distinguishing the pieces from each other. Each logical
segment is maintained independantly but API accesses check both
(or all N such segments) when doing lookups, and write to whichever
one has free space.
Virtual segments are pretty standard fare on replication slaves which
are intended to archive a great deal more history then replication
masters, but are intended to be very rare features of replication
masters. If a virtual segment must be created on a replication master
the filesystem does all it can to shift data off the virtual segment
in order to be able to collapse it back into a logical segment.
Virtual segmentation is not space efficient.
Files and Directories
As has been described, a data object (which can be a file or directory
or other filesystem entity) is identified by a data object id, which
is effectively its inode, and a 64 bit key field. The key field is
typically a base offset for a file or a sortable key for a directory
entry. Negative numbers are used for meta-data. For example, -1 will
be used for the inode data & stat information. Other negative numbers
will be used to support other meta-data such as ACLs.
Since records support variable-length data, the data storage for a file
is effectively extent-based. Minimum granularity will be something like
Files are thus fairly straight forward.
From the point of view of the filesystem each directory entry will
be a data record. The data record will basically just contain the
inode number, type, and filename. It will be variable-length insofar
as the filename is variable length.
Most files will be representable in a single extent (or at least
can be optimized into a single extent), and so will not depend very
heavily on indexing.
Directory lookups WILL depend on the indexing of the directory records
for efficiency, otherwise every record would have to be scanned to
lookup a filename. In this regard a B+Tree is probably the best
Inode Number Uniqueness
Inode numbers are 64 bits where the top N bits represent the segment
in which the inode was originally allocated, identifying the master
record for the data object. Inode numbers do not change even if
the master record is relocated and the original master record replaced
with a forwarding record. The forwarding record must be retained
in order to guarentee the uniqueness of the inode number.
This also allows inode numbers to be allocated on a per-segment basis,
with 48 bits of uniqueness (one trillion file creates & deletes) within
The filesystem will always allocate new inode numbers using a sequence
number maintained in the segment header. When all 48 bits are used up,
however, the filesystem must check new inode numbers against existing
numbers (including any deleted historical objects).
Inode number uniqueness over all time can no longer be guarenteed, but
inode number uniqueness over a period of time CAN still be guarenteed
by retaining the master record for deleted files for a period of time
EVEN AFTER THEY HAVE BEEN DESTROYED. The system operator can specify
the retention time with the caveat that certain benchmarks might cause
the disk to fill up or become somewhat less efficient even when
historical data is purged.
It is recommended that any exported or clustered filesystems set the
inode number retention time to at least a week. Inode number uniqueness
is crucial to cluster operation.
More information about the Kernel