Description of the Journaling topology

Wed Dec 29 22:45:17 PST 2004

On Wed, Dec 29, 2004 at 08:49:34PM -0800, Matthew Dillon wrote:
>:
>:Are you talking cluster fs here? Or would (will) that be by a separate
>:mechanism?
>:
>:// George
>
>    A cache-coherent clustered filesystem is really several things integrated
>    together.  The journaling will be used as the data transport mechanism
>    to keep caches on multiple machines synchronized.  There will have to be
>    a cache coherency mechanism/protocol in addition to that, of course,
>    since journaling alone is not cache-coherent.  You can also look at the
>    problem from the point of view of the cache-coherency protocol.  A
>    cache-coherency protocol just wants to deal with meta-information, it
>    doesn't want to deal with the actual data transfer mechanism.  So the
>    journal is a good fit.

Maybe I don't know enough about the kernel environment just prior to an
fopen for write, but this is sounding overly complex. At the protocol
level aren't we concerned about one thing? Atomic transactions. However
many "hot" physical devices there are across whatever network, shouldn't
they all finish before the exit 0?

Minimizing the data to transfer across the slowest segment to a physical
device will lower transfer times, unless that procedure (eg compression)
overweighs the delay. (I wonder if it is possible to send less data by
only transmitting the _changes_ to a block device...)

Now that I've laid out my words, what you're saying makes more sense,
but I think "journal" is not a good word for it. You want to block a
CPU's fopen to a warm physical device mirror, from the moment a hot
device gets an fopen write to the same file, until the hot device write
is committed to all the warm mirrors. Yeah, maybe journal (of propagating
writes) is a good name for it.

But here are a few things to ponder, will a 1Gb nfs or 10Gb fiber to a
GFS on a fast raid server just be better and cheaper than a bunch of
warm mirrors? How much of a performance hit will the journaling code
be, especially on local partitions with kernels that only use it for a
"shared" mount point? djbdns logging is good example, even if you log to
/dev/null, generation of the logged info is a significant performance
hit for the app. I guess all I'm saying is, if the journaling is not
being used, bypass it!

As far as a coherent VFS cache protocol, I'm reminded of wise words from
Josh, a db programmer, "the key to performance is in minimizing the
quantity of data," ie use bit tokens instead of keywords in the db. And,
it was Ike that put the Spread toolkit in my "try someday" list,

        http://www.spread.org/

        Spread is a toolkit that provides a high performance messaging
        service that is resilient to faults across external or
        internal networks. Spread functions as a unified message
        bus for distributed applications, and provides highly tuned
        application-level multicast and group communication support.

>    For example, lets say you have a cluster of 30 machines and for 
>    robustness you want to 100% replicate your main filesystems on 3 of
>    those machines.  So now you have a situation where 3 of the machines
>    needs to stay completely up-to-date with each other, and 27 of the
>    machines need to be able to cache data on a more temporary basis.
>    Both situations can be made nothing more then aspects of the *SAME*
>    cache-coherency and journaling protocols.  The only difference is
>    that some of the machines require a large journaling and cache
>    coherency data volume (the ones doing the mirroring), while other
>    machines require far smaller volumes of data to be transfered.  It
>    sounds like a complex problem but it is actually no more complex then
>    what the cache coherency protocol must already accomplish within the
>    cluster.

That's an excellent example!

// George

-- 
George Georgalis, systems architect, administrator Linux BSD IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:george at xxxxxxxxx