Plans for 1.8+ (2.0?)
Matthew Dillon
dillon at apollo.backplane.com
Wed Jan 31 15:21:37 PST 2007
:> One big advantage of a from-scratch design is that I would be
:> able to address the requirements of a clustered operating system
:> in addition the requirements of multi-terrabyte storage media.
:...
:is parallel file system in this scope, or it is the requirement you
:have for cluster OS? what would be the advantage of designing a new
:one compared to the current and existing one, considering the
:advantage of the dfly kernel for SSI os to other kernel? put in
:another way, is a new fs that will take advantage of the kernel
:advantage superior to others, from what we can predict now.
:
:even too much questions, hope they are all relevant.
:
:Noah
No, its a lot more complex then that. There are three basic issues:
* Redundancy in a heavily distributed environment
* Transactional Consistency.
* Cache Coherency and conflict management.
The best way to think about what is needed is to consider the problem
from two points of view: First, from the point of view of a threaded
program whos threads might be distributed across a number of machines.
Second, from the point of view of disparate programs which may access
the same data store and may generate unsynchronized conflicts (for
example if you type 'cp xxx blah' in one shells and 'cp yyy blah' in
another shell at the same time.
In the first case there is a definite order. If the threaded program
writes "xyz" to a file then it fully expects to see "xyz" in that
file if it reads it back in a different thread.
In the second case there is an implied conflict, but no failure, and
an indefinite order... the file 'blah' might be a copy of the file
'xxx' or a copy of the file 'yyy', or some mess inbetween, because
you ran the operations at the same time and there is no implied order.
You HAVE TO HAVE THIS capability or the filesystem will not scale
to the cluster.
cp xxx blah &
cp yyy blah &
--
If this sounds like the Backplane Database that some people may know
about (that I wrote a few years ago), then you would be right. The
problems that have to be solved are exactly the same problems that
that database solved, but in a filesystem environment instead of a
database environment.
For a distributed filesystem to handle these cases is a very complex
problem. The easiest solution is the one I used for the Backplane
Database... you make the filesystem infinitely snapshottable based on
a transaction ID. So you can say 'show me what the filesystem looked
like at transaction ID XYZ'. Different redundant data entities
within the filesystem will be synchronized to different transaction
IDs.
Accesses to the filesystem which must be transactionally consistent in
series (write "xyz" and when you read it back you expect "xyz" back)
simply get a new transaction ID representing the new syncnronization
point every time the program makes a modification to the filesystem.
The sequence of transaction IDs represents the serialization within
that domain. The only thing that would need to be fully coherent
between distributed threads would thus be the transaction ID, and you
let the filesystem sort out where to actually get the data, and thus
the access to the filesystem would also wind up being fully coherent.
In a live filesystem, which is a very dynamic environment, some portions
of the dataset will always be more synchronized then other portions.
Synchronization would be a continuous background operation so ultimately
all redundant data entities would become fully synchronized, or
(in a dynamically changing environment) fully synchronized 'as of'
a particular transaction ID (some timestamp in the past), which is
continuously playing catch-up.
Unsynchronized elements that generate conflicts would have to be
resolved asynchronously. For example the copy case above.
cp xxx blah & (unsynchronized conflicts generated)
cp yyy blah & (unsynchronized conflicts generated)
wait
cat blah (conflicts must be resolved here)
cat blah (so you get the same results if you repeat the command)
The advantage of this type of setup is that not all data accesses
need to be transactionally consistent. Two programs operating
independantly do not have to be syncnronized with each other. For
example, if you were to do this:
% cp xxx blah &
...
% tail -f blah
Then the 'cp' program and the 'tail' program would not have to be
synchronized with each other via the shell (at least not until the
'cp' program finishes and the shell reports that it has finished).
The tail program can be operating with older transaction ID's and be
playing catch-up. Things like flock and lockf would force
synchronization between disparate programs.
Since you get a natural separation of the data consistency requirements,
the actual version of the data accessed can be more localized (that is,
use a slightly older and possibly more localized snapshot of the
filesystem). When the data must be synchronized the filesystem might
have to go over the network to get the most up-to-date data, or might
have to block waiting for synchronization to occur.
--
In anycase, it is a very complex issue. The jist of it, however, is
that you need to be able to store data redundantly without actually
having to wait for all redundant stores to be updated before being
able to access the data.
Consider the case where you have a 4TB filesystem representing your
life's work on your local computer. This is going to be the case in a
few years (especially with photos, video, music, etc...). We are becoming
a computer-centric society. If all the data is in just one place, it
is VERY fragile. You lose the hard drive, you lose the data. RAID
systems don't really address the problem... now you lose the data if
you have a fire. Distributed RAID-like systems address the problem
but currently require fast links to really be usable, and we still
have a recovery problem (as in time-wise) when things go wrong or the
filesystem gets corrupted (meaning that RAID alone does not prevent
filesystem corruption), so the data is still fragile.
In a snapshot system you always have access to some version of the
data when things blow up, it is just a matter of how far back in
time you have to go, and whether you are willing to wait for the
filesystem to repair itself to a more recent transaction date. In
otherwords, you have a lot of choices and a guarentee that the
worst that happens is you lose a few (minutes, hours, days) of
work rather then the entire filesystem.
More importantly, you can have redundancy implemented over slow
links... a continuous backup scheme that operates entirely
asynchronously and provides an entirely consistent snapshot of
the filesystem on the backup as-of some date in the past (constantly
playing catch-up), and unlike the journaling that we have now,
this feature would not effect filesystem performance. That's really
the holy grail right there.
ZFS addresses some of these problems, in particular it modernizes the
idea of having RAID-like redundancy, but it doesn't address the
need not only for a distributed filesystem, but for a distributed
filesystem that is capable of operating robustly over slow links
(aka 'the internet'), nor does it address the issue of operating
coherently in a distributed manner, or the issue of continuous
backups which can be accessed at any time in a fully consistent manner
and which are integrated with, but also independant (access wise) of
the filesystem.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the Kernel
mailing list