Description of the Journaling topology
dillon at apollo.backplane.com
Wed Dec 29 23:53:40 PST 2004
:level aren't we concerned about one thing? Atomic transactions. However
:many "hot" physical devices there are across whatever network, shouldn't
:they all finish before the exit 0?
:Minimizing the data to transfer across the slowest segment to a physical
:device will lower transfer times, unless that procedure (eg compression)
:overweighs the delay. (I wonder if it is possible to send less data by
:only transmitting the _changes_ to a block device...)
Your definition of what constitutes an 'atomic transaction' is not
quite right, and that is where the confusion is stemming from.
An atomic transaction (that is, a cache-coherent transaction) does not
necessarily need to push anything out to other machines in order to
complete the operation. All it needs a mastership of the data or
meta-data involved. For example, if you are trying to create a
file O_CREAT|O_EXCL, all you need is mastership of the namespace
representing that file name.
Note that I am NOT talking about a database 'transaction' in the
traditional hard-storage sense, because that is NOT what machines need
to do most of the time.
This 'mastership' requires communication with the other machine in the
cluster, but the communication may have ALREADY occurred sometime in the
past. That is, your machine might ALREADY have mastership of the
necessary resources, which means that your machine can conclude the
operation without any further communication.
In otherwords, your machine would be able to execute the create operation
and return from the open() without having to synchronously communicate
with anyone else, and still maintain a fully cache coherent topology
across the entire cluster.
The management of 'mastership' of resources is the responsibility of the
cache coherency layer in the system. It is not the responsibility of the
journal. The journal's only responsibility is to buffer the operation
and shove it out to the other machines, but that can be done
ASYNCHRONOUSLY, long after your machine's open() returned. It can do
this because the other machines will not be able to touch the resource
anyway, whether the journal has written it out or not, because they
do not have mastership of the resource... they would have to talk to
your machine to gain mastership of the resource before they could mess
with the namespace which means that your machine then has the opportunity
to ensure that the related data has been synchronized to the requesting
machine (via the journal) before handing over mastership of the data to
There is no simple way to do this. Cache coherency protocols are complex
beasts. Very complex beasts. I've written such protocols in the past
and it takes me several man-months to do one (and this is me we are
talking about, lightning-programmer-matt-dillon). Fortunately having
done it I pretty much know the algorithms by heart now :-)
:But here are a few things to ponder, will a 1Gb nfs or 10Gb fiber to a
:GFS on a fast raid server just be better and cheaper than a bunch of
:warm mirrors? How much of a performance hit will the journaling code
:be, especially on local partitions with kernels that only use it for a
:"shared" mount point? djbdns logging is good example, even if you log to
:/dev/null, generation of the logged info is a significant performance
:hit for the app. I guess all I'm saying is, if the journaling is not
:being used, bypass it!
Well, what is the purpose of the journaling in this context? If you
are trying to have an independant near-realtime backup of your
filesystem then obviously you can't consolidate it into the same
physical hardware you are running from normally, that would kinda kill
the whole point.
As for performance... well, if you are journaling over a socket then
the overhead from the point of view of the originating machine is
basically the overhead of writing the journal to a TCP socket (read:
not very high relative to other mechanisms). Bandwidth is always
an issue, but only if there is actual interference with other activities.
If you are trying to mirror data in a clustered system, ignoring
robustness issues, then the question is whether the clustered system
is going to be more efficient with three mirrored sources for the
filesystem data or with just one source. If the cost of getting the
data to the other mirrors is small then there is an obvious performance
benefit to being able to pull it from several places rather then from
just one place. For one thing, you might not need 10G links to a
single consolidated server... 1G links to multiple machines might be
sufficient. Cost is no small issue here, either.
There are lots of variables in the equation, and no simple answer.
:As far as a coherent VFS cache protocol, I'm reminded of wise words from
:Josh, a db programmer, "the key to performance is in minimizing the
:quantity of data," ie use bit tokens instead of keywords in the db. And,
:it was Ike that put the Spread toolkit in my "try someday" list,
:George Georgalis, systems architect, administrator Linux BSD IXOYE
The key to performance is multifold. It isn't just minimizing the
amount of data transfered... it's minimizing latency, its being able to
asynchronize data transfers so programs do not have to stall waiting
for things, and its being able to already have the data present locally
when it is requested rather then have to go over the wire to get it
every time. What is faster, gzip'ing a 1G file and transporting it over
a GiGE network or transporting the uncompressed 1G file over a GiGE
network? The answer is: it depends how fast your cpu is and has little
to do with how fast your network is (beyond a certain point).
Now, of course, it *IS* true if you are comparing a brute-force
algorithm with one that only transports needed data. e.g. if a program
is trying to read 4KB out of a 10GB file it is obviously faster to just
ship the 4KB over rather then ship the entire 10GB file over.
One of the things a good cache coherency protocol does is reduce the
amount of duplicate data being transfered between boxes. Duplicate
information is a real killer. So in that sense a good cache
coherency algorithm can help a great deal.
You can think of the cache coherency problem somewhat like the way
cpu caches work in SMP systems. Obviously any given cpu does not have
to synchronously talk to all the other cpus every time it does a memory
access. The reason: the cache coherency protocol gives that cpu certain
guarentees in various situations that allow the cpu to access a great
deal of data from cache instantly, without communicating with the other
cpus, yet still maintain an illusion of atomicy. For example, I'm sure
you've heard the comments about the overhead of getting and releasing a
mutex in FreeBSD: "It's fast if the cpu owns the cache line".
"It's slow if several cpus are competing for the same mutex but fast if
the same cpu is getting and releasing the mutex over and over again".
There's a reason why atomic bus operations are sometimes 'fast' and
<dillon at xxxxxxxxxxxxx>
More information about the Kernel