Description of the Journaling topology

Sat Jan 1 07:03:06 PST 2005

Matthew Dillon <dillon at xxxxxxxxxxxxxxxxxxxx> writes:

[...]

>
>     So why hasn't it been done or, at least, why isn't it unversal after all
>     these years?
>
>     It's a good question.  I think it comes down to how most programmers
>     have been educated over the years.  Its funny, but whenever I build
>     something new the first question I usually get is "what paper is your
>     work based on?".  I get it every time, without fail.  And every time,
>     without fail, I find myself trying to explain to the questioner that
>     I generally do not bother to *READ* research papers...  that I build 
>     systems from scratch based on one or two sentence's worth of concept.

I cannot say for "most programmers" but as far as few (mostly Linux)
journalled file systems that I have experience in are concerned, I am
sure situation is slightly different.

Traditional journalling design that you are describing, writes
journalled information twice: in the log and "in-place". As modern file
systems tend to journal both data and meta-data this doubles amount of
storage traffic. Besides, even with all transaction batching, writes to
the log require additional seeks. The latter may be not important for
the situation you seems to have in mind, viz. keeping off-site log
accessible over network, but this _is_ important for the normal local
file system that, in the simplest case, keeps log on the same device as
the rest of file system.

Attempts to improve performance of journalling file systems in this
regard, mainly rotate around a cluster of (arguably very old) ideas
called variously "shadows" (in the data-base world), "phase trees"
(Tux2), and "wandering logs" (in reiser4). From little technical
information available on Solaris ZFS, it seems it also uses something
similar.

These solutions, by their very nature, require tight cooperation between
journalling and intimate innards of file system, which mostly rules out
any kind of "universal journalling layer".

Add to this other optimizations, like delayed block allocation (that are
almost a must for any modern file system) that also interfere with
journalling in non-trivial ways, and you will see why it's hard to
devise a common API for efficient journalling.

Basically, ext3 is the only Linux file system that uses classical WAL
redo-only logging straight from Gray&Reuters, and lo, --- there even is
standalone journalling module (fs/jdb) for it, or for any other file
system that uses plain WAL.

So, in some sense, I see situation as in some sense opposite to what you
are saying: there is no common journalling API (in Linux), precisely
because modern file systems developers are _not_ following classical
papers on how to do journalling. Instead they explore new mechanisms to
guarantee data consistency that, and in this situation straight-jacket
of predefined universal API would be an obstacle.

Nikita.