Plans for 1.8+ (2.0?)

B. Estrade estrabd at gmail.com
Tue Feb 20 07:02:49 PST 2007


On 2/19/07, Matthew Dillon <dillon at apollo.backplane.com> wrote:
    I've been letting the conversation run to see what people have to say,
    I am going to select this posting by Brett to answer, as well as
    provide some more information.
:I am not sure I understand the potential aim of the new file system -
:is it to allow all nodes on the SSI (I purposefully avoid terms like
:"grid") to have all "local" data actually on their hard drive or is it
:more like each node is aware of all data on the SSI, but the data may
:be scattered about all of the nodes on the SSI?
    I have many requirements that need to be fullfilled by the new
    filesystem.  I have just completed the basic design work and I feel
    quite confident that I can have the basics working by our Summer
    release.
    - On-demand filesystem check and recovery.  No need to scan the entire
      filesystem before going live after a reboot.
    - Infinite snapshots (e.g. on 30 second sync), ability to collapse
      snapshots for any time interval as a means of recovering space.
    - Multi-master replication at the logical layer (not the physical layer),
      including the ability to self-heal a corrupted filesystem by accessing
      replicated data.  Multi-master means that each replicated store can
      act as a master and new filesystem ops can run independantly on any
      of the replicated stores and will flow to the others.
    - Infinite log Replication.  No requirement to keep a log of changes
      for the purposes of replication, meaning that replication targets can
      be offline for 'days' without effecting performance or operation.
      ('mobile' computing, but also replication over slow links for backup
      purposes and other things).
    - 64 bit file space, 64 bit filesystem space.  No space restrictions
      whatsoever.
    - Reliably handle data storage for huge multi-hundred-terrabyte
      filesystems without fear of unrecoverable corruption.
    - Cluster operation - ability to commit data to locally replicated
      store independantly of other nodes, access governed by cache
      coherency protocols.
:So, in effect, is it similar in concept to the notion of storing bits
:of files across many places using some unified knowledge of where the
:bits are? This of course implies redunancy and creates synchronization
:problems to handle (assuming bo global clock), but I certainly think
:it is a good goal.  In reality, how redundant will the data be?  In a
:practical sense, I think the principle of "locality" applies here -
:the pieces that make up large files will all be located very close to
:one another (aka, clustered around some single location).
    Generally speaking the topology is up to the end-user.  The main
    issue for me is that the type of replication being done here is
    logical layer replication, not physical replication.  You can
    think of it as running a totally independant filesystem for each
    replication target, but the filesystems cooperate with each other
    and cooperate in a clustered environment to provide a unified,
    coherent result.
:>From my experiences, one of the largest issues related to large scale
:computing is the movement of large files, but with the trend moving
:towards huges local disks and many-core architectures (which I agree
:with), I see the "grid" concept of geographically diverse machines
:connected as a single system being put to rest in favor of local
:clusters of many-core machines.
:...
:With that, the approach that DfBSD is taking is vital wrt distributed
:computing, but any hard requirement of moving huge files long
:distances (even if done so in parallel) might not be so great.  What
:is required is native parallel I/O that is able to handle locally
:distributed situations - because it is within a close proximity that
:many processes would be writing to a "single" file.  Reducing the
:scale of the problem may provide some clues into how it may be used
:and how it should handle the various situations effectively.
:...
:Additionally, the concept of large files somewhat disappears when you
:are talking about shipping off virtual processes to execute on some
:other processor or core because they are not shipped off with a whole
:lot of data to work on.  I know this is not necessarily a SSI concept,
:but one that DfBSD will have people wanting to do.
:
:Cheers,
:Brett
    There are two major issues here:  (1) Where the large files reside
    and (2) How much of that data running programs need to access.
    For example, lets say you had a 16 gigabyte database.  There is a
    big difference between scanning the entire 16 gigabytes of data
    and doing a query that only has to access a few hundred kilobytes
    of the data.
    No matter what, you can't avoid reading data that a program insists
    on reading.  If the data is not cacheable or the amount of data
    being read is huge, the cluster has the choice of moving the program's
    running context closer to the storage, or transfering the data over
    the network.
    A cluster can also potentially use large local disks as a large
    cache in order to be able to distribute the execution contexts, even
    when the data being accessed is not local.  It might be reasonable to
    have a 100TB data store with 1TB local 'caches' in each LAN-based locale.
    Either way I think it is pretty clear that individual nodes are
    significantly more powerful when coupled with the availability of
    local (LAN) hard disk storage, especially in the types of environments
    that I want DragonFly to cater to.
    The DragonFly clustering is meant to be general purpose... I really want
    to be able to migrate execution contexts, data, and everything else
    related to a running program *without* having to actually freeze the
    program while the migration occurs.  That gives system operators the
    ultimate ability to fine-tune a cluster.
                                        -Matt
                                        Matthew Dillon
                                        <dillon at backplane.com>
Thanks for that, Matt.

Cheers,
Brett
--
AIM: bz743
Desk @ LONI/HPC:
225.578.1920




More information about the Users mailing list