Description of the Journaling topology

Fri Dec 31 01:13:24 PST 2004

:I'm using a simplistic, though, I think, accurate perspective. When I
:used the term atomic, I simply meant when a file is (being) modified it
:is not available until that change is complete, ie no chance of reading
:a half written file. This entire process is what I meant by transaction.
:I'm not sure of a better way to describe that than "atomic transaction"
:even though those words often have other meaning. I'd also call that
:"transaction" a "commit" to the filesystem for similar reasons. Also,
:I mean a context above the VFS. Simply, it's imperative that in user
:space, a file read cannot complete if that file is open for write,
:similarly a file write should not alter the the outcome of a read that
:has already started. I could be wrong about this, but it seems a good
:requirement to have.

    These are the types of guarentees that a cache coherency protocol is
    able to make.

:the fundamental requirements. Is that process above something we can
:expect from the kernel? Or, will the need to write to a temporary file
:on the device and mv it to the actual name when the write is complete,
:continue in user space, for the "atomic" effect? Which is really what I
:was getting at.
:...
:If this read blocking while file write process is handled by the
:kernel (VFS?), wouldn't it greatly simplify a cache-coherent messaging

    The atomic effect is accomplished by having a cache coherency protocol
    which locks out conflicting operations until the conflict is resolved.
    So, for example, if you were to do a 2GB write() and it took 30 seconds
    to accomplish this, then the cache coherency mechanism might prevent
    anyone else from reading any portion of the seek range involved in the
    write until the write completes. 

:>    This 'mastership' requires communication with the other machine in the
:>    cluster, but the communication may have ALREADY occurred sometime in the
:>    past.  That is, your machine might ALREADY have mastership of the
:>    necessary resources, which means that your machine can conclude the
:>    operation without any further communication.
:
:Indeed it's the unpredictable order by which the warm mirrors get in
:sync that gives value to the term coherent!
:
:So, what about conflicts? What happens when a node tries to write
:a file, but another node also tries to write the same file, a tick
:earlier?
:...
:Do you mean a warm mirror can commit to local disk and trigger the 3 hot
:mirrors to sync up and propagate to the other 20? Slick!
:...
:<other questions about cache coherency mechanisms>

    Ok, you are asking a lot of good questions here and most of them relate
    to the same mechanism so I will tackle all of them at once.

    First, lets consider the case where you have 3 machines which in a 
    multi-master configuration.  That is, you do NOT have a designated
    'master' machine, and now several of these machines wish to issue
    conflicting I/O's to a common set of data that all three machines 
    have independant copies of.  Lets say that all three copies are
    synchronized to start with.

    From any given machine's point of view, lets say from machine A's point
    of view, the first order of business is to gain mastership of the zone
    in which they intend to do the I/O.  None of the machines currently
    have mastership of the zone in question so this requires some
    communication.

    machine A sends a mastership request to itself, to machine B and to
    machine C.  Machine A then awaits responses.

    machine A (responding to itself):  Ok, you can have mastership.

    machine B (responding to A):  Ok, A, you can have mastership.

    machine C (responding to A):  I'd really rather have mastership, so I
				  would like to deny you (A) mastership.
				  (say machine C started its own conflicting
				  I/O at near the same time and wants 
				  mastership too).

    machine C does the same thing... it gives itself an OK and it asks A and
    B whether it can have the OK, but in this case A and B say NO, you can't.

    So you have this situation:

    machine A:		A, B say A can have mastership.  C says no.
    machine C:		C says C can have mastership, A and B say no.

    So who wins ?  The answer is actually simple... you use what is called
    a QUORUM VOTING algorithm.  Whichever machines gets a quorum of 
    affirmative responses is the machine that gains mastership of the zone.

    In this case, the quorum for 3 machines is 2 (it's basically half the
    machines negotiating the request plus one).  Since A has an affirmative
    response from itself and B, it has the quorum and thus mastership, and
    thus A can proceed with the operation.  In fact, A is able to proceed
    the moment it receives affirmation of mastership from EITHER B or C.
    It does not have to wait for BOTH B and C to affirm mastership.  It
    only needs a quorum.

    Once mastership has been assigned, in this case to A, you then have to 
    deal with the fact that A will now have the only up-to-date copy of the
    data and that B and C's copy will be stale and unusable.  If B now wants
    mastership and performs the same quorum negotiation, C, with its stale
    copy *CANNOT* be allowed to return an affirmative response to B's
    request because that would then make another quorum (B and C) and B
    would wind up operating with stale data (because A's journal update
    may not have reached B and/or C yet).

    So another part of the cache coherency protocol is to ensure that
    operations only occur with the correct, non-stale data.

    The way that is done is also quite simple in concept, but a very complex
    implementation in reality.

    What happens is that during A's initial negotiation, A supplies a 
    TRANSACTION ID to represent the zone being modified.  A includes this
    transaction ID in its journal updates, but keep in mind that journal
    updates are performed asynchronously.  But since a quorum of machines
    must have responded affirmatively to A's original request, that also
    means that a quorum of machines now *KNOW* what A's transaction ID is.
    This effects the quorum negotiation in two ways:

    * First, a machine which knows it has out of date data (i.e. it knows
      what A's transaction ID was but it has not yet received the journal 
      update from A with that transaction id)... machines in this state 
      MAY NOT PARTICIPATE in a quorum negotiation.

      e.g. when B now asks for mastership, C will NOT respond in the
      affirmative to B, so B will not be able to get a quorum yet.  In
      fact B will not even respond in affirmative to its own request
      because it also knows that its local copy of the data is stale.

    * As A's modifications are asynchronously propogated to other machines
      in the cluster, these other machines will be able to begin to
      participate in quorum negotiations for the conflicting zone.  Once
      a quorum of machines has the update, B will get sufficient affirmative
      responses to be able to start its own transaction.

    * Now, once B has enough responses, it is still possible for B's own local
      copy of the data to be stale.  But B will know this, because in the
      affirmative responses the other machines have told B what the
      transaction ID of the latest data is.  So B can opt to either wait
      for the asynchronous journal update from one of the other machines,
      or B can proactively request the latest date from one of the other
      machines.

    The cache coherency protocol also has to deal with the case where NO
    machine gets a quorum of affirmative responses.  This can occur if
    several machines try to request mastership of the same zone at the
    same time. 

    e.g. if A, B, and C all simultaniously want mastership, they each
    respond affirmatively to themselves and negatively to the others,
    so none of the machines wind up with a quorum.  There are many ways
    to 'solve' this deadlock.  For example, they could abort, wait a 
    random period of time, and retry, or they could further negotiate
    based on a fixed priority, or negotiate based on a moving token, or
    any number of possible ways.

    Keep in mind here that what I am talking about is the CACHE COHERENCY
    protocol *NOT* the JOURNALING protocol.  In fact, in all my descriptions
    of the cache coherency protocol above you will note that the actual act
    of sending the journal out to other machines in the cluster does not
    get more sophisticated... the journal is still written out asynchronously
    and it can still be buffered by any amount, and with any latency.  That
    is, the journaling mechanism is virtually independant of the cache
    coherency protocol.

    Also note that the cache coherency mechanism I describe above is only
    one of many possible ways of doing cache coherency, and it is also very
    complex.   High performance cache coherency mechanisms are by
    definition very complex beasts.  There is no simple solution or simple
    implementation.  Even my description above is cutting about half a dozen
    other cases that have to be dealt with.  For example, I have not 
    described crash recovery issues with quorum systems at all, and those
    can get pretty complex.  I have not discussed how nodes are added and
    deleted from the cluster from the point of view of the protocol, either,
    and I have not discussed how temporary (memory) caches are maintained.
    It gets very, very, VERY complex, very quickly.

:If mastership can freely (with restrictions!) move to any box in the
:system, doesn't that dictate the cache coherent system must be organized
:_outside_ of all the systems? Since we can't expect something from
:nothing, the steady state should be just that, nothing. All the disks
:...

    No, multi-master cache coherency protocols (with no single master)
    are definitely possible.  I described one above. 

:...
:signals: warning, a journal entry in the pipe (coming, to be followed
:by an IO). Each box listens for external coherency signals (inode by
:UDP, ICMP?) and prepares for a journal and IO. Maybe better than silent

    Not necessarily.  Your thinking processes are tending towards the
    assumption that systems have to be synchronized in order to be able
    to begin executing an operation.  In fact, systems do not have to
    be synchronized.  The negotiation of mastership can be independant of
    the data synchronization mechanism.  A machine can negotiate mastership
    without having the most up-to-date data.  The trick is that it has to
    KNOW that it doesn't have the most up-to-date data and either wait for
    it or proactively fetch it.

:steady state, each node listens for a 'synchronized and ready' heart
:beat from all the other nodes, waits a short interval and broadcasts its
:own 'synchronized and ready' heart beat. Normally (during no operation),
:the beats would align them selves on the delay interval and function as
:indicator the coherency system is up and ready. If a signal is missed
:the data propagation path (mesh) is worked out to originate from the box
:with the missing heartbeat, before the data comes. (kinda like AMD NUMA
:ram bus discovery at power-up)

    Again, your thinking process here is that of a system maintaining a
    fully synchronized state.  This isn't a bad way to think about it,
    but fully synchronous systems do not equate to high performance systems
    so you really have to think 'semi-synchronous' or 'asynchronous' rather
    then 'synchronous' when you think about these protocols.

:>    There is no simple way to do this.  Cache coherency protocols are complex
:>    beasts.  Very complex beasts.  I've written such protocols in the past
:>    and it takes me several man-months to do one (and this is me we are
:>    talking about, lightning-programmer-matt-dillon).  Fortunately having
:>    done it I pretty much know the algorithms by heart now :-)
:
:How's my algorithm? ...I never said it was easy!

    Not a bad start.  Better then most people would have put together having
    never written one of these beasts before.  I am impressed, in fact! :-)

:>    If you are trying to mirror data in a clustered system, ignoring
:>    robustness issues,
:
:What sort of robustness issues do you think I'm ignoring? Diskless
:clients aren't susceptible to disk failure, nor do they need expensive
:local raid.  The idea behind a raid NFS/GFS for diskless clients is all
:the availability robustness is applied, cost effectively, in one place,
:more bang-for-buck as the number of nodes increase. But, maybe you are
:considering another type of robustness?

    Which is more robust, three systems talking to a single RAID box
    or three systems each with their own independant copy of the data
    which use a quorum cache coherency algorithm to stay golden with each
    other?

    RAID method:  Two of the three systems can go down and you are still
		  golden, but if the RAID hardware goes down you are screwed.

    QUORUM method: One of the three systems can go down and you are still
		   golden.

		   But what if you had 5 systems?  7?  31?  Is a quorum of
		   31 (i.e. 16 machines have to be up) more robust then a
		   single piece of RAID hardware?

    But you know, there is no single 'correct' answer to that question.
    For example, the RAID hardware might be so reliable that it would
    normally be more reliable then a 2-out-of-3-systems-are-up configuration,
    but on the other hand the RAID hardware exists in one physical locale
    and it is going to be toast of the building burns down, whereas with
    the QUORUM method, with three machines in three separate buildings,
    you actually could have any one of those three buildings burn down and
    still have a working system.  The network also effects the answer.
    A LAN is far more reliable then a WAN, and that can be a major factor.

:>    One of the things a good cache coherency protocol does is reduce the
:>    amount of duplicate data being transfered between boxes.  Duplicate
:>    information is a real killer.  So in that sense a good cache
:>    coherency algorithm can help a great deal.
:
:sorta like auto-discovery mesh networks? or do you mean rsync like IO?
:Regards,
:// George

    Typically this is a combination of strategic mirroring of data and
    also the caching of data in memory (similar to how a normal disk cache
    works).

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>