Description of the Journaling topology

Wed Dec 29 23:53:40 PST 2004

:...
:level aren't we concerned about one thing? Atomic transactions. However
:many "hot" physical devices there are across whatever network, shouldn't
:they all finish before the exit 0?
:
:Minimizing the data to transfer across the slowest segment to a physical
:device will lower transfer times, unless that procedure (eg compression)
:overweighs the delay. (I wonder if it is possible to send less data by
:only transmitting the _changes_ to a block device...)

    Your definition of what constitutes an 'atomic transaction' is not
    quite right, and that is where the confusion is stemming from.

    An atomic transaction (that is, a cache-coherent transaction) does not
    necessarily need to push anything out to other machines in order to
    complete the operation.  All it needs a mastership of the data or 
    meta-data involved.  For example, if you are trying to create a
    file O_CREAT|O_EXCL, all you need is mastership of the namespace 
    representing that file name.

    Note that I am NOT talking about a database 'transaction' in the
    traditional hard-storage sense, because that is NOT what machines need
    to do most of the time.

    This 'mastership' requires communication with the other machine in the
    cluster, but the communication may have ALREADY occurred sometime in the
    past.  That is, your machine might ALREADY have mastership of the
    necessary resources, which means that your machine can conclude the
    operation without any further communication.

    In otherwords, your machine would be able to execute the create operation
    and return from the open() without having to synchronously communicate
    with anyone else, and still maintain a fully cache coherent topology
    across the entire cluster.

    The management of 'mastership' of resources is the responsibility of the
    cache coherency layer in the system.  It is not the responsibility of the
    journal.  The journal's only responsibility is to buffer the operation 
    and shove it out to the other machines, but that can be done 
    ASYNCHRONOUSLY, long after your machine's open() returned.  It can do
    this because the other machines will not be able to touch the resource
    anyway, whether the journal has written it out or not, because they
    do not have mastership of the resource... they would have to talk to
    your machine to gain mastership of the resource before they could mess
    with the namespace which means that your machine then has the opportunity
    to ensure that the related data has been synchronized to the requesting
    machine (via the journal) before handing over mastership of the data to
    that machine.

    There is no simple way to do this.  Cache coherency protocols are complex
    beasts.  Very complex beasts.  I've written such protocols in the past
    and it takes me several man-months to do one (and this is me we are
    talking about, lightning-programmer-matt-dillon).  Fortunately having
    done it I pretty much know the algorithms by heart now :-)

:But here are a few things to ponder, will a 1Gb nfs or 10Gb fiber to a
:GFS on a fast raid server just be better and cheaper than a bunch of
:warm mirrors? How much of a performance hit will the journaling code
:be, especially on local partitions with kernels that only use it for a
:"shared" mount point? djbdns logging is good example, even if you log to
:/dev/null, generation of the logged info is a significant performance
:hit for the app. I guess all I'm saying is, if the journaling is not
:being used, bypass it!

    Well, what is the purpose of the journaling in this context?  If you
    are trying to have an independant near-realtime backup of your 
    filesystem then obviously you can't consolidate it into the same 
    physical hardware you are running from normally, that would kinda kill
    the whole point.

    As for performance... well, if you are journaling over a socket then
    the overhead from the point of view of the originating machine is
    basically the overhead of writing the journal to a TCP socket (read:
    not very high relative to other mechanisms).  Bandwidth is always
    an issue, but only if there is actual interference with other activities.

    If you are trying to mirror data in a clustered system, ignoring
    robustness issues, then the question is whether the clustered system
    is going to be more efficient with three mirrored sources for the 
    filesystem data or with just one source.  If the cost of getting the
    data to the other mirrors is small then there is an obvious performance
    benefit to being able to pull it from several places rather then from
    just one place.  For one thing, you might not need 10G links to a
    single consolidated server... 1G links to multiple machines might be
    sufficient.  Cost is no small issue here, either.

    There are lots of variables in the equation, and no simple answer.

:As far as a coherent VFS cache protocol, I'm reminded of wise words from
:Josh, a db programmer, "the key to performance is in minimizing the
:quantity of data," ie use bit tokens instead of keywords in the db. And,
:it was Ike that put the Spread toolkit in my "try someday" list,
:...
:// George
:-- 
:George Georgalis, systems architect, administrator Linux BSD IXOYE

    The key to performance is multifold.  It isn't just minimizing the
    amount of data transfered... it's minimizing latency, its being able to
    asynchronize data transfers so programs do not have to stall waiting
    for things, and its being able to already have the data present locally
    when it is requested rather then have to go over the wire to get it
    every time.  What is faster, gzip'ing a 1G file and transporting it over
    a GiGE network or transporting the uncompressed 1G file over a GiGE
    network?  The answer is:  it depends how fast your cpu is and has little
    to do with how fast your network is (beyond a certain point).

    Now, of course, it *IS* true if you are comparing a brute-force 
    algorithm with one that only transports needed data.  e.g. if a program
    is trying to read 4KB out of a 10GB file it is obviously faster to just
    ship the 4KB over rather then ship the entire 10GB file over.

    One of the things a good cache coherency protocol does is reduce the
    amount of duplicate data being transfered between boxes.  Duplicate
    information is a real killer.  So in that sense a good cache
    coherency algorithm can help a great deal.

    You can think of the cache coherency problem somewhat like the way
    cpu caches work in SMP systems.  Obviously any given cpu does not have
    to synchronously talk to all the other cpus every time it does a memory
    access.  The reason: the cache coherency protocol gives that cpu certain
    guarentees in various situations that allow the cpu to access a great
    deal of data from cache instantly, without communicating with the other
    cpus, yet still maintain an illusion of atomicy.  For example, I'm sure
    you've heard the comments about the overhead of getting and releasing a
    mutex in FreeBSD:  "It's fast if the cpu owns the cache line".  
    "It's slow if several cpus are competing for the same mutex but fast if
    the same cpu is getting and releasing the mutex over and over again".
    There's a reason why atomic bus operations are sometimes 'fast' and 
    sometimes 'slow'.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>