Description of the Journaling topology
dillon at apollo.backplane.com
Fri Dec 31 01:13:24 PST 2004
:I'm using a simplistic, though, I think, accurate perspective. When I
:used the term atomic, I simply meant when a file is (being) modified it
:is not available until that change is complete, ie no chance of reading
:a half written file. This entire process is what I meant by transaction.
:I'm not sure of a better way to describe that than "atomic transaction"
:even though those words often have other meaning. I'd also call that
:"transaction" a "commit" to the filesystem for similar reasons. Also,
:I mean a context above the VFS. Simply, it's imperative that in user
:space, a file read cannot complete if that file is open for write,
:similarly a file write should not alter the the outcome of a read that
:has already started. I could be wrong about this, but it seems a good
:requirement to have.
These are the types of guarentees that a cache coherency protocol is
able to make.
:the fundamental requirements. Is that process above something we can
:expect from the kernel? Or, will the need to write to a temporary file
:on the device and mv it to the actual name when the write is complete,
:continue in user space, for the "atomic" effect? Which is really what I
:was getting at.
:If this read blocking while file write process is handled by the
:kernel (VFS?), wouldn't it greatly simplify a cache-coherent messaging
The atomic effect is accomplished by having a cache coherency protocol
which locks out conflicting operations until the conflict is resolved.
So, for example, if you were to do a 2GB write() and it took 30 seconds
to accomplish this, then the cache coherency mechanism might prevent
anyone else from reading any portion of the seek range involved in the
write until the write completes.
:> This 'mastership' requires communication with the other machine in the
:> cluster, but the communication may have ALREADY occurred sometime in the
:> past. That is, your machine might ALREADY have mastership of the
:> necessary resources, which means that your machine can conclude the
:> operation without any further communication.
:Indeed it's the unpredictable order by which the warm mirrors get in
:sync that gives value to the term coherent!
:So, what about conflicts? What happens when a node tries to write
:a file, but another node also tries to write the same file, a tick
:Do you mean a warm mirror can commit to local disk and trigger the 3 hot
:mirrors to sync up and propagate to the other 20? Slick!
:<other questions about cache coherency mechanisms>
Ok, you are asking a lot of good questions here and most of them relate
to the same mechanism so I will tackle all of them at once.
First, lets consider the case where you have 3 machines which in a
multi-master configuration. That is, you do NOT have a designated
'master' machine, and now several of these machines wish to issue
conflicting I/O's to a common set of data that all three machines
have independant copies of. Lets say that all three copies are
synchronized to start with.
From any given machine's point of view, lets say from machine A's point
of view, the first order of business is to gain mastership of the zone
in which they intend to do the I/O. None of the machines currently
have mastership of the zone in question so this requires some
machine A sends a mastership request to itself, to machine B and to
machine C. Machine A then awaits responses.
machine A (responding to itself): Ok, you can have mastership.
machine B (responding to A): Ok, A, you can have mastership.
machine C (responding to A): I'd really rather have mastership, so I
would like to deny you (A) mastership.
(say machine C started its own conflicting
I/O at near the same time and wants
machine C does the same thing... it gives itself an OK and it asks A and
B whether it can have the OK, but in this case A and B say NO, you can't.
So you have this situation:
machine A: A, B say A can have mastership. C says no.
machine C: C says C can have mastership, A and B say no.
So who wins ? The answer is actually simple... you use what is called
a QUORUM VOTING algorithm. Whichever machines gets a quorum of
affirmative responses is the machine that gains mastership of the zone.
In this case, the quorum for 3 machines is 2 (it's basically half the
machines negotiating the request plus one). Since A has an affirmative
response from itself and B, it has the quorum and thus mastership, and
thus A can proceed with the operation. In fact, A is able to proceed
the moment it receives affirmation of mastership from EITHER B or C.
It does not have to wait for BOTH B and C to affirm mastership. It
only needs a quorum.
Once mastership has been assigned, in this case to A, you then have to
deal with the fact that A will now have the only up-to-date copy of the
data and that B and C's copy will be stale and unusable. If B now wants
mastership and performs the same quorum negotiation, C, with its stale
copy *CANNOT* be allowed to return an affirmative response to B's
request because that would then make another quorum (B and C) and B
would wind up operating with stale data (because A's journal update
may not have reached B and/or C yet).
So another part of the cache coherency protocol is to ensure that
operations only occur with the correct, non-stale data.
The way that is done is also quite simple in concept, but a very complex
implementation in reality.
What happens is that during A's initial negotiation, A supplies a
TRANSACTION ID to represent the zone being modified. A includes this
transaction ID in its journal updates, but keep in mind that journal
updates are performed asynchronously. But since a quorum of machines
must have responded affirmatively to A's original request, that also
means that a quorum of machines now *KNOW* what A's transaction ID is.
This effects the quorum negotiation in two ways:
* First, a machine which knows it has out of date data (i.e. it knows
what A's transaction ID was but it has not yet received the journal
update from A with that transaction id)... machines in this state
MAY NOT PARTICIPATE in a quorum negotiation.
e.g. when B now asks for mastership, C will NOT respond in the
affirmative to B, so B will not be able to get a quorum yet. In
fact B will not even respond in affirmative to its own request
because it also knows that its local copy of the data is stale.
* As A's modifications are asynchronously propogated to other machines
in the cluster, these other machines will be able to begin to
participate in quorum negotiations for the conflicting zone. Once
a quorum of machines has the update, B will get sufficient affirmative
responses to be able to start its own transaction.
* Now, once B has enough responses, it is still possible for B's own local
copy of the data to be stale. But B will know this, because in the
affirmative responses the other machines have told B what the
transaction ID of the latest data is. So B can opt to either wait
for the asynchronous journal update from one of the other machines,
or B can proactively request the latest date from one of the other
The cache coherency protocol also has to deal with the case where NO
machine gets a quorum of affirmative responses. This can occur if
several machines try to request mastership of the same zone at the
e.g. if A, B, and C all simultaniously want mastership, they each
respond affirmatively to themselves and negatively to the others,
so none of the machines wind up with a quorum. There are many ways
to 'solve' this deadlock. For example, they could abort, wait a
random period of time, and retry, or they could further negotiate
based on a fixed priority, or negotiate based on a moving token, or
any number of possible ways.
Keep in mind here that what I am talking about is the CACHE COHERENCY
protocol *NOT* the JOURNALING protocol. In fact, in all my descriptions
of the cache coherency protocol above you will note that the actual act
of sending the journal out to other machines in the cluster does not
get more sophisticated... the journal is still written out asynchronously
and it can still be buffered by any amount, and with any latency. That
is, the journaling mechanism is virtually independant of the cache
Also note that the cache coherency mechanism I describe above is only
one of many possible ways of doing cache coherency, and it is also very
complex. High performance cache coherency mechanisms are by
definition very complex beasts. There is no simple solution or simple
implementation. Even my description above is cutting about half a dozen
other cases that have to be dealt with. For example, I have not
described crash recovery issues with quorum systems at all, and those
can get pretty complex. I have not discussed how nodes are added and
deleted from the cluster from the point of view of the protocol, either,
and I have not discussed how temporary (memory) caches are maintained.
It gets very, very, VERY complex, very quickly.
:If mastership can freely (with restrictions!) move to any box in the
:system, doesn't that dictate the cache coherent system must be organized
:_outside_ of all the systems? Since we can't expect something from
:nothing, the steady state should be just that, nothing. All the disks
No, multi-master cache coherency protocols (with no single master)
are definitely possible. I described one above.
:signals: warning, a journal entry in the pipe (coming, to be followed
:by an IO). Each box listens for external coherency signals (inode by
:UDP, ICMP?) and prepares for a journal and IO. Maybe better than silent
Not necessarily. Your thinking processes are tending towards the
assumption that systems have to be synchronized in order to be able
to begin executing an operation. In fact, systems do not have to
be synchronized. The negotiation of mastership can be independant of
the data synchronization mechanism. A machine can negotiate mastership
without having the most up-to-date data. The trick is that it has to
KNOW that it doesn't have the most up-to-date data and either wait for
it or proactively fetch it.
:steady state, each node listens for a 'synchronized and ready' heart
:beat from all the other nodes, waits a short interval and broadcasts its
:own 'synchronized and ready' heart beat. Normally (during no operation),
:the beats would align them selves on the delay interval and function as
:indicator the coherency system is up and ready. If a signal is missed
:the data propagation path (mesh) is worked out to originate from the box
:with the missing heartbeat, before the data comes. (kinda like AMD NUMA
:ram bus discovery at power-up)
Again, your thinking process here is that of a system maintaining a
fully synchronized state. This isn't a bad way to think about it,
but fully synchronous systems do not equate to high performance systems
so you really have to think 'semi-synchronous' or 'asynchronous' rather
then 'synchronous' when you think about these protocols.
:> There is no simple way to do this. Cache coherency protocols are complex
:> beasts. Very complex beasts. I've written such protocols in the past
:> and it takes me several man-months to do one (and this is me we are
:> talking about, lightning-programmer-matt-dillon). Fortunately having
:> done it I pretty much know the algorithms by heart now :-)
:How's my algorithm? ...I never said it was easy!
Not a bad start. Better then most people would have put together having
never written one of these beasts before. I am impressed, in fact! :-)
:> If you are trying to mirror data in a clustered system, ignoring
:> robustness issues,
:What sort of robustness issues do you think I'm ignoring? Diskless
:clients aren't susceptible to disk failure, nor do they need expensive
:local raid. The idea behind a raid NFS/GFS for diskless clients is all
:the availability robustness is applied, cost effectively, in one place,
:more bang-for-buck as the number of nodes increase. But, maybe you are
:considering another type of robustness?
Which is more robust, three systems talking to a single RAID box
or three systems each with their own independant copy of the data
which use a quorum cache coherency algorithm to stay golden with each
RAID method: Two of the three systems can go down and you are still
golden, but if the RAID hardware goes down you are screwed.
QUORUM method: One of the three systems can go down and you are still
But what if you had 5 systems? 7? 31? Is a quorum of
31 (i.e. 16 machines have to be up) more robust then a
single piece of RAID hardware?
But you know, there is no single 'correct' answer to that question.
For example, the RAID hardware might be so reliable that it would
normally be more reliable then a 2-out-of-3-systems-are-up configuration,
but on the other hand the RAID hardware exists in one physical locale
and it is going to be toast of the building burns down, whereas with
the QUORUM method, with three machines in three separate buildings,
you actually could have any one of those three buildings burn down and
still have a working system. The network also effects the answer.
A LAN is far more reliable then a WAN, and that can be a major factor.
:> One of the things a good cache coherency protocol does is reduce the
:> amount of duplicate data being transfered between boxes. Duplicate
:> information is a real killer. So in that sense a good cache
:> coherency algorithm can help a great deal.
:sorta like auto-discovery mesh networks? or do you mean rsync like IO?
Typically this is a combination of strategic mirroring of data and
also the caching of data in memory (similar to how a normal disk cache
<dillon at xxxxxxxxxxxxx>
More information about the Kernel