Description of the Journaling topology
dillon at apollo.backplane.com
Wed Dec 29 21:04:51 PST 2004
:Rahul Siddharthan wrote:
:> I'm no expert but I thought the traditional case was fast recovery to
:> a consistent filesystem state (avoiding a long fsck), not recovery of
:> buffered data or fast writing of buffered data to disk. I'm pretty
:> sure ext3, for example, with its default async mount, is very
:> susceptible to losing data. ufs+softupdates most certainly can lose a
:> lot of buffered data.
:A buffer is not a journal, its a buffer. Journaling file systems put the
:journal ON DISK--if power is lost you replay the journal FROM DISK to
:recover consistent file system. This scheme will not allow that because
:the journal is kept in memory. You can use it for transparent backup,
:but how useful is it for recovery from crashes/power loss? It seems like
: transaction based VFS mirroring, but you cannot replay the journal if
:the system crashes or otherwise reboots unexpectedly.
I think you are a little confused, Gary. The journal we are talking
about is buffered, yes, but only for a short period of time (e.g. less
then a second). This is NO DIFFERENT from what a journaling filesystem
does. When you type 'mkdir blah' in a journaling filesystem it does
*NOT* instantly write the operation out to the journal. Disk performance
would go completely to pot if it did that.
All high performance filesystems buffer to some degree. It is not
possible to build a high performance filesystem that does not buffer
(that is, it would no longer be 'high performance' if it didn't).
The key issue here is not that buffering is occuring, but how long the
data remains in the buffer before it gets shipped off to hard storage
somewhere (locally or over the net). That's the issue. And here when
we consider something like, oh, a RAID system's battery backed ram...
that would be considered hard storage, but it does not and cannot
replace the buffering that the kernel does.
So what you gain during crash recovery is the ability to restore the
filesystem to its state up to N seconds before the crash, where N
depends on the filesystem. With a softupdates filesystem N could be
upwards of 30 seconds. With ReiserFS I would expect N to be in the
< 10 second range. But N will never be 0. The journaling I am
implementing would allow N to be programmed. It could be as little
as a millisecond or as much as the memory buffer can hold depending on
the system operator's preference.
Even more key is the off-site capability. If the journal is a TCP
connection to another machine the buffering delay could be as little
as a millisecond before the data gets to the target machine, and the
local disks would not be impacted at all. The originating machine
could immediately crash without really messing anything up, even if
the data has not yet been committed to hard storage on the target
machine. The target machine could be configured to buffer the data
again before comitting to hard storage, or it could commit it
immediately. A key performance issue is that a target machine could
be dedicated to journaled backups of other machines in a cluster and
basically only have to issue linear writes, yielding very high
So there are some very practical and desireable traits being discussed
<dillon at xxxxxxxxxxxxx>
More information about the Kernel