Plans for 1.8+ (2.0?)

Matthew Dillon dillon at apollo.backplane.com
Wed Jan 31 15:21:37 PST 2007


:>     One big advantage of a from-scratch design is that I would be
:>     able to address the requirements of a clustered operating system
:>     in addition the requirements of multi-terrabyte storage media.
:...
:is parallel file system in this scope, or it is the requirement you
:have for cluster OS? what would be the advantage of designing a new
:one compared to the current and existing one, considering the
:advantage of the dfly kernel for SSI os to other kernel? put in
:another way, is a new fs that will take advantage of the kernel
:advantage superior to others,  from what we can predict now.
:
:even too much questions, hope they are all relevant.
:
:Noah

    No, its a lot more complex then that.  There are three basic issues:

    * Redundancy in a heavily distributed environment

    * Transactional Consistency.

    * Cache Coherency and conflict management.

    The best way to think about what is needed is to consider the problem
    from two points of view:  First, from the point of view of a threaded
    program whos threads might be distributed across a number of machines.
    Second, from the point of view of disparate programs which may access
    the same data store and may generate unsynchronized conflicts (for
    example if you type 'cp xxx blah' in one shells and 'cp yyy blah' in
    another shell at the same time.

    In the first case there is a definite order.  If the threaded program
    writes "xyz" to a file then it fully expects to see "xyz" in that
    file if it reads it back in a different thread.

    In the second case there is an implied conflict, but no failure, and
    an indefinite order... the file 'blah' might be a copy of the file
    'xxx' or a copy of the file 'yyy', or some mess inbetween, because
    you ran the operations at the same time and there is no implied order.
    You HAVE TO HAVE THIS capability or the filesystem will not scale
    to the cluster.

	cp xxx blah &
	cp yyy blah &

    --

    If this sounds like the Backplane Database that some people may know
    about (that I wrote a few years ago), then you would be right.  The
    problems that have to be solved are exactly the same problems that
    that database solved, but in a filesystem environment instead of a
    database environment.

    For a distributed filesystem to handle these cases is a very complex
    problem.  The easiest solution is the one I used for the Backplane
    Database... you make the filesystem infinitely snapshottable based on
    a transaction ID.  So you can say 'show me what the filesystem looked
    like at transaction ID XYZ'.  Different redundant data entities 
    within the filesystem will be synchronized to different transaction
    IDs.

    Accesses to the filesystem which must be transactionally consistent in 
    series (write "xyz" and when you read it back you expect "xyz" back)
    simply get a new transaction ID representing the new syncnronization
    point every time the program makes a modification to the filesystem.
    The sequence of transaction IDs represents the serialization within
    that domain.  The only thing that would need to be fully coherent
    between distributed threads would thus be the transaction ID, and you
    let the filesystem sort out where to actually get the data, and thus
    the access to the filesystem would also wind up being fully coherent.

    In a live filesystem, which is a very dynamic environment, some portions
    of the dataset will always be more synchronized then other portions.
    Synchronization would be a continuous background operation so ultimately
    all redundant data entities would become fully synchronized, or 
    (in a dynamically changing environment) fully synchronized 'as of'
    a particular transaction ID (some timestamp in the past), which is
    continuously playing catch-up.

    Unsynchronized elements that generate conflicts would have to be
    resolved asynchronously.  For example the copy case above.

	cp xxx blah &	(unsynchronized conflicts generated)
	cp yyy blah &	(unsynchronized conflicts generated)
	wait
	cat blah	(conflicts must be resolved here)
	cat blah	(so you get the same results if you repeat the command)

    The advantage of this type of setup is that not all data accesses 
    need to be transactionally consistent.  Two programs operating
    independantly do not have to be syncnronized with each other.  For
    example, if you were to do this:

	% cp xxx blah &
	...
	% tail -f blah

    Then the 'cp' program and the 'tail' program would not have to be
    synchronized with each other via the shell (at least not until the
    'cp' program finishes and the shell reports that it has finished).
    The tail program can be operating with older transaction ID's and be
    playing catch-up.  Things like flock and lockf would force 
    synchronization between disparate programs.

    Since you get a natural separation of the data consistency requirements,
    the actual version of the data accessed can be more localized (that is,
    use a slightly older and possibly more localized snapshot of the
    filesystem).  When the data must be synchronized the filesystem might
    have to go over the network to get the most up-to-date data, or might
    have to block waiting for synchronization to occur.

    --

    In anycase, it is a very complex issue.  The jist of it, however, is
    that you need to be able to store data redundantly without actually
    having to wait for all redundant stores to be updated before being
    able to access the data.

    Consider the case where you have a 4TB filesystem representing your
    life's work on your local computer.  This is going to be the case in a
    few years (especially with photos, video, music, etc...).  We are becoming
    a computer-centric society.  If all the data is in just one place, it
    is VERY fragile.  You lose the hard drive, you lose the data.  RAID
    systems don't really address the problem... now you lose the data if
    you have a fire.  Distributed RAID-like systems address the problem
    but currently require fast links to really be usable, and we still
    have a recovery problem (as in time-wise) when things go wrong or the
    filesystem gets corrupted (meaning that RAID alone does not prevent 
    filesystem corruption), so the data is still fragile.

    In a snapshot system you always have access to some version of the
    data when things blow up, it is just a matter of how far back in
    time you have to go, and whether you are willing to wait for the
    filesystem to repair itself to a more recent transaction date.  In
    otherwords, you have a lot of choices and a guarentee that the 
    worst that happens is you lose a few (minutes, hours, days) of
    work rather then the entire filesystem.

    More importantly, you can have redundancy implemented over slow
    links... a continuous backup scheme that operates entirely 
    asynchronously and provides an entirely consistent snapshot of
    the filesystem on the backup as-of some date in the past (constantly
    playing catch-up), and unlike the journaling that we have now,
    this feature would not effect filesystem performance.  That's really
    the holy grail right there.

    ZFS addresses some of these problems, in particular it modernizes the
    idea of having RAID-like redundancy, but it doesn't address the
    need not only for a distributed filesystem, but for a distributed
    filesystem that is capable of operating robustly over slow links
    (aka 'the internet'), nor does it address the issue of operating
    coherently in a distributed manner, or the issue of continuous
    backups which can be accessed at any time in a fully consistent manner
    and which are integrated with, but also independant (access wise) of
    the filesystem.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Kernel mailing list