No subject

Mon Feb 12 16:27:08 PST 2007

kplane.com> <45D0F302.4050508 at exemail.com.au>
From: Matthew Dillon <dillon at apollo.backplane.com>
Subject: Re: Plans for 1.8+ (2.0?)
Date: Mon, 12 Feb 2007 16:05:53 -0800 (PST)
BestServHost: crater.dragonflybsd.org
List-Post: <mailto:kernel at crater.dragonflybsd.org>
List-Subscribe: <mailto:kernel-request at crater.dragonflybsd.org?body=subscribe>
List-Unsubscribe: <mailto:kernel-request at crater.dragonflybsd.org?body=unsubscribe>
List-Help: <mailto:kernel-request at crater.dragonflybsd.org?body=help>
List-Owner: <mailto:owner-kernel at crater.dragonflybsd.org>
Sender: kernel-errors at crater.dragonflybsd.org
Errors-To: kernel-errors at crater.dragonflybsd.org
Lines: 74
NNTP-Posting-Host: 216.240.41.25
X-Trace: 1171326580 crater_reader.dragonflybsd.org 831 216.240.41.25
Xref: crater_reader.dragonflybsd.org dragonfly.kernel:10581

:Hi Matt,
:Is moving VFS to userland still part of your clustering master plan? :) 
:if it is, is it planned for 2.0?
:
:Petr

    SYSLINK certainly - the communications protocol that will be used for
    filesystem access, thus allowing filesystems in userspace in addition
    to filesystems across the cluster.

    Even with the virtual kernel support I really want to develop the
    filesystem in userland.

    I haven't decided on the filesystem yet, but I am leaning towards
    doing a from-scratch design that will be suitable for our clustering,
    size, and robustness requirements.

    I am still working out the design and will not really know how doable
    it will be in the 2.0 time-frame.  It may be 2.1 before we have a new
    filesystem.

    --

    I've been working on a design spec and will post more information in a
    week or two.  Basically, though, we have to be able to cut up physical
    storage into very large chunks (which can be indexed in kernel memory),
    e.g. like 8GB chunks, and then be able to associate the chunks with
    various filesystems and in various ways.  Chunks would simply represent
    physical storage, either local or remote, and not necessarily be
    linearly indexed.

    It would also be possible to assign redundancy, whereby two (or more)
    chunks are considered to be mirrors of each other.  However, for
    robustness we would not mirror them in actual fact but would instead
    assign dynamic block numbers (i.e. non-linear addressing) every time a
    bit of data is flushed to physical storage, allowing the data chunks
    to hold a complete historical record, which in turn not only allows
    virtually infinite snapshots but also allows just one of the redundant
    chunks to be written and for the other ones to be updated asynchronously.

    So, for example, if you had a 200GB local disk and you purchased a 200GB
    chunk of storage off the internet, a 200GB filesystem would have to
    be able to run at full speed to local disk and then copy the updated
    data asynchronously over the potentially very slow internet link to the
    200GB of redundant storage.  You would want such a filesystem to 
    operate at full speed, as if it were just on the local disk.

    Another example, if you had a cluster of two machines, each with a 200GB
    hard disk, and you wanted a single 200GB filesystem whos storage was
    fully redundant on both machines, then any filesystem update made
    by a particular machine would first update its local disk, then 
    asynchronously copy the new information to the other machine over the
    network (without having to hold the original data in kernel memory).

    AND at the same time you want filesystem operations issued on the
    other machine to do the same thing... immediately write to ITS local
    disk and then copy the data to the other disk asynchronously, giving
    us a multi-master environment.

    Having a multi-master environment is absolutely critical.

    --

    In anycase, it may seem complex but I think it is possible to build such
    a filesystem.

    ZFS does some of the things we want.  Much of what I described is
    ZFS-like.  The problem though is that ZFS does not handle the cluster
    aspects of the filesystem that we absolutely have to handle, and the
    more I look at ZFS the more I think it would take longer to port it
    then it would to write one from scratch.

							-Matt