Description of the Journaling topology

George Georgalis george at galis.org
Thu Dec 30 23:19:57 PST 2004


On Wed, Dec 29, 2004 at 11:51:48PM -0800, Matthew Dillon wrote:
>
>:...
>:level aren't we concerned about one thing? Atomic transactions. However
>:many "hot" physical devices there are across whatever network, shouldn't
>:they all finish before the exit 0?
>:
>:Minimizing the data to transfer across the slowest segment to a physical
>:device will lower transfer times, unless that procedure (eg compression)
>:overweighs the delay. (I wonder if it is possible to send less data by
>:only transmitting the _changes_ to a block device...)
>
>    Your definition of what constitutes an 'atomic transaction' is not
>    quite right, and that is where the confusion is stemming from.
>
>    An atomic transaction (that is, a cache-coherent transaction) does not
>    necessarily need to push anything out to other machines in order to
>    complete the operation.  All it needs a mastership of the data or 
>    meta-data involved.  For example, if you are trying to create a
>    file O_CREAT|O_EXCL, all you need is mastership of the namespace 
>    representing that file name.
>
>    Note that I am NOT talking about a database 'transaction' in the
>    traditional hard-storage sense, because that is NOT what machines need
>    to do most of the time.

I'm using a simplistic, though, I think, accurate perspective. When I
used the term atomic, I simply meant when a file is (being) modified it
is not available until that change is complete, ie no chance of reading
a half written file. This entire process is what I meant by transaction.
I'm not sure of a better way to describe that than "atomic transaction"
even though those words often have other meaning. I'd also call that
"transaction" a "commit" to the filesystem for similar reasons. Also,
I mean a context above the VFS. Simply, it's imperative that in user
space, a file read cannot complete if that file is open for write,
similarly a file write should not alter the the outcome of a read that
has already started. I could be wrong about this, but it seems a good
requirement to have.

As for the magic of VFS, that's all golden and I hope it works! I really
wanted to clarify the above paragraph or get on the same page as far as
the fundamental requirements. Is that process above something we can
expect from the kernel? Or, will the need to write to a temporary file
on the device and mv it to the actual name when the write is complete,
continue in user space, for the "atomic" effect? Which is really what I
was getting at.


If this read blocking while file write process is handled by the
kernel (VFS?), wouldn't it greatly simplify a cache-coherent messaging
system? Per your earlier example, the 3 hot mirrors broadcast their
intent to write a file (first come, first write basis) via the cache
coherency protocol, the journal socket then comes with specific details,
the remaining 2 (and indeed the first) hot devices block file reads
until the IO of associated writes is complete across the 3 machines,
since this blocking comes from the kernel (VFS?), there is no multiple
partially written file race condition. The remaining 20 hosts get notice
from the VFS per coherent cache protocol, followed by journal and IO,
from the 3 hot mirrors.

>
>    This 'mastership' requires communication with the other machine in the
>    cluster, but the communication may have ALREADY occurred sometime in the
>    past.  That is, your machine might ALREADY have mastership of the
>    necessary resources, which means that your machine can conclude the
>    operation without any further communication.

Indeed it's the unpredictable order by which the warm mirrors get in
sync that gives value to the term coherent!

So, what about conflicts? What happens when a node tries to write
a file, but another node also tries to write the same file, a tick
earlier?


>    In otherwords, your machine would be able to execute the create operation
>    and return from the open() without having to synchronously communicate
>    with anyone else, and still maintain a fully cache coherent topology
>    across the entire cluster.

Do you mean a warm mirror can commit to local disk and trigger the 3 hot
mirrors to sync up and propagate to the other 20? Slick!

>    The management of 'mastership' of resources is the responsibility of the
>    cache coherency layer in the system.  It is not the responsibility of the
>    journal.  The journal's only responsibility is to buffer the operation 
>    and shove it out to the other machines, but that can be done 
>    ASYNCHRONOUSLY, long after your machine's open() returned.  It can do
>    this because the other machines will not be able to touch the resource
>    anyway, whether the journal has written it out or not, because they
>    do not have mastership of the resource... they would have to talk to
>    your machine to gain mastership of the resource before they could mess
>    with the namespace which means that your machine then has the opportunity
>    to ensure that the related data has been synchronized to the requesting
>    machine (via the journal) before handing over mastership of the data to
>    that machine.

If mastership can freely (with restrictions!) move to any box in the
system, doesn't that dictate the cache coherent system must be organized
_outside_ of all the systems? Since we can't expect something from
nothing, the steady state should be just that, nothing. All the disks
are in sync, and they know so because they processed all prior coherent
signals: warning, a journal entry in the pipe (coming, to be followed
by an IO). Each box listens for external coherency signals (inode by
UDP, ICMP?) and prepares for a journal and IO. Maybe better than silent
steady state, each node listens for a 'synchronized and ready' heart
beat from all the other nodes, waits a short interval and broadcasts its
own 'synchronized and ready' heart beat. Normally (during no operation),
the beats would align them selves on the delay interval and function as
indicator the coherency system is up and ready. If a signal is missed
the data propagation path (mesh) is worked out to originate from the box
with the missing heartbeat, before the data comes. (kinda like AMD NUMA
ram bus discovery at power-up)


>    There is no simple way to do this.  Cache coherency protocols are complex
>    beasts.  Very complex beasts.  I've written such protocols in the past
>    and it takes me several man-months to do one (and this is me we are
>    talking about, lightning-programmer-matt-dillon).  Fortunately having
>    done it I pretty much know the algorithms by heart now :-)

How's my algorithm? ...I never said it was easy!


>:But here are a few things to ponder, will a 1Gb nfs or 10Gb fiber to a
>:GFS on a fast raid server just be better and cheaper than a bunch of
>:warm mirrors? How much of a performance hit will the journaling code
>:be, especially on local partitions with kernels that only use it for a
>:"shared" mount point? djbdns logging is good example, even if you log to
>:/dev/null, generation of the logged info is a significant performance
>:hit for the app. I guess all I'm saying is, if the journaling is not
>:being used, bypass it!
>
>    Well, what is the purpose of the journaling in this context?  If you
>    are trying to have an independant near-realtime backup of your 
>    filesystem then obviously you can't consolidate it into the same 
>    physical hardware you are running from normally, that would kinda kill
>    the whole point.

I'm not making a purpose for journaling here. Just considering the
options.  Again from a simplistic perspective. If the idea behind cache
coherent cluster filesystem is performance (seems there are other
advantages too); how does the bang-for-buck benchmark compare between
coherent nodes with local disks, and diskless clients using a single
striped mirrored (realtime backup) raid NFS? Or, if you have a lot more
than 20 hosts, fiber GFS?

Or, maybe you are looking for a solution other than brute force
performance?


>    If you are trying to mirror data in a clustered system, ignoring
>    robustness issues,

What sort of robustness issues do you think I'm ignoring? Diskless
clients aren't susceptible to disk failure, nor do they need expensive
local raid.  The idea behind a raid NFS/GFS for diskless clients is all
the availability robustness is applied, cost effectively, in one place,
more bang-for-buck as the number of nodes increase. But, maybe you are
considering another type of robustness?


>    The key to performance is multifold.  It isn't just minimizing the
>    amount of data transfered... it's minimizing latency, its being able to
>    asynchronize data transfers so programs do not have to stall waiting

I didn't mean the cluster needs be synchronsed, just _file_
read/writes... advantages/disadvantages

>    One of the things a good cache coherency protocol does is reduce the
>    amount of duplicate data being transfered between boxes.  Duplicate
>    information is a real killer.  So in that sense a good cache
>    coherency algorithm can help a great deal.

sorta like auto-discovery mesh networks? or do you mean rsync like IO?

>    You can think of the cache coherency problem somewhat like the way
>    cpu caches work in SMP systems.  Obviously any given cpu does not have
>    to synchronously talk to all the other cpus every time it does a memory
>    access.  The reason: the cache coherency protocol gives that cpu certain
>    guarentees in various situations that allow the cpu to access a great
>    deal of data from cache instantly, without communicating with the other
>    cpus, yet still maintain an illusion of atomicy.  For example, I'm sure
>    you've heard the comments about the overhead of getting and releasing a
>    mutex in FreeBSD:  "It's fast if the cpu owns the cache line".  
>    "It's slow if several cpus are competing for the same mutex but fast if
>    the same cpu is getting and releasing the mutex over and over again".
>    There's a reason why atomic bus operations are sometimes 'fast' and 
>    sometimes 'slow'.

Recently I heard AMD engineer David O'Brien, present advantages and
nuances of working with the AMD64. SMP NUMA caching scenarios where
discussed at length. What's not in the CPU (motherboards), is a means
to (control) flash copy (a la DMA) one memory region to another, idea
being, if a 4 way CPU gets free it can off load a process from another
CPU's queue, and get the associated memory too, without tying up a cpu
using the regular memory bus. A system a lot like how your "multiple
path to sync" filesystems coherent cache system seems to want to work.

Regards,
// George


-- 
George Georgalis, systems architect, administrator Linux BSD IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:george at xxxxxxxxx





More information about the Kernel mailing list