Mon Feb 12 16:27:08 PST 2007
kplane.com> <45D0F302.4050508 at exemail.com.au>
From: Matthew Dillon <dillon at apollo.backplane.com>
Subject: Re: Plans for 1.8+ (2.0?)
Date: Mon, 12 Feb 2007 16:05:53 -0800 (PST)
List-Post: <mailto:kernel at crater.dragonflybsd.org>
List-Subscribe: <mailto:kernel-request at crater.dragonflybsd.org?body=subscribe>
List-Unsubscribe: <mailto:kernel-request at crater.dragonflybsd.org?body=unsubscribe>
List-Help: <mailto:kernel-request at crater.dragonflybsd.org?body=help>
List-Owner: <mailto:owner-kernel at crater.dragonflybsd.org>
Sender: kernel-errors at crater.dragonflybsd.org
Errors-To: kernel-errors at crater.dragonflybsd.org
X-Trace: 1171326580 crater_reader.dragonflybsd.org 831 18.104.22.168
Xref: crater_reader.dragonflybsd.org dragonfly.kernel:10581
:Is moving VFS to userland still part of your clustering master plan? :)
:if it is, is it planned for 2.0?
SYSLINK certainly - the communications protocol that will be used for
filesystem access, thus allowing filesystems in userspace in addition
to filesystems across the cluster.
Even with the virtual kernel support I really want to develop the
filesystem in userland.
I haven't decided on the filesystem yet, but I am leaning towards
doing a from-scratch design that will be suitable for our clustering,
size, and robustness requirements.
I am still working out the design and will not really know how doable
it will be in the 2.0 time-frame. It may be 2.1 before we have a new
I've been working on a design spec and will post more information in a
week or two. Basically, though, we have to be able to cut up physical
storage into very large chunks (which can be indexed in kernel memory),
e.g. like 8GB chunks, and then be able to associate the chunks with
various filesystems and in various ways. Chunks would simply represent
physical storage, either local or remote, and not necessarily be
It would also be possible to assign redundancy, whereby two (or more)
chunks are considered to be mirrors of each other. However, for
robustness we would not mirror them in actual fact but would instead
assign dynamic block numbers (i.e. non-linear addressing) every time a
bit of data is flushed to physical storage, allowing the data chunks
to hold a complete historical record, which in turn not only allows
virtually infinite snapshots but also allows just one of the redundant
chunks to be written and for the other ones to be updated asynchronously.
So, for example, if you had a 200GB local disk and you purchased a 200GB
chunk of storage off the internet, a 200GB filesystem would have to
be able to run at full speed to local disk and then copy the updated
data asynchronously over the potentially very slow internet link to the
200GB of redundant storage. You would want such a filesystem to
operate at full speed, as if it were just on the local disk.
Another example, if you had a cluster of two machines, each with a 200GB
hard disk, and you wanted a single 200GB filesystem whos storage was
fully redundant on both machines, then any filesystem update made
by a particular machine would first update its local disk, then
asynchronously copy the new information to the other machine over the
network (without having to hold the original data in kernel memory).
AND at the same time you want filesystem operations issued on the
other machine to do the same thing... immediately write to ITS local
disk and then copy the data to the other disk asynchronously, giving
us a multi-master environment.
Having a multi-master environment is absolutely critical.
In anycase, it may seem complex but I think it is possible to build such
ZFS does some of the things we want. Much of what I described is
ZFS-like. The problem though is that ZFS does not handle the cluster
aspects of the filesystem that we absolutely have to handle, and the
more I look at ZFS the more I think it would take longer to port it
then it would to write one from scratch.
More information about the Kernel