Description of the Journaling topology

Thu Dec 30 16:27:27 PST 2004

I think that you miss the main idea of journaling fs, which is that 
filesystem ensures that journal entry for operation is always created 
*before* relevant operation physically takes place. This isn't guranteed 
in your design.

Yes, some buffering may apply, and is applied in existing 
implementations, but the filesystem should *never* commit actual update 
before appropriate journaling entry. In your case, however, it is 
possible that filesystem will commit some changes to the physical 
storage and due to buffering it will lost appropriate journaling entry 
in the case of crash. In this case, attempt to replay journal may have 
disastrous consequences, since you will not know what have been changed 
by that "lost" operation.

Therefore, IMO, it is impossible to do journaling without co-operation 
from the filesystem code and without implementing acks from whoever does 
actual recording of journaling entries to the persistent storage.

-Maxim

Matthew Dillon wrote:
:
:Rahul Siddharthan wrote:
:> I'm no expert but I thought the traditional case was fast recovery to
:> a consistent filesystem state (avoiding a long fsck), not recovery of
:> buffered data or fast writing of buffered data to disk.  I'm pretty
:> sure ext3, for example, with its default async mount, is very
:> susceptible to losing data.  ufs+softupdates most certainly can lose a
:> lot of buffered data.
:> 
:> Rahul
:
:A buffer is not a journal, its a buffer. Journaling file systems put the 
:journal ON DISK--if power is lost you replay the journal FROM DISK to 
:recover consistent file system. This scheme will not allow that because 
:the journal is kept in memory. You can use it for transparent backup, 
:but how useful is it for recovery from crashes/power loss? It seems like 
:  transaction based VFS mirroring, but you cannot replay the journal if 
:the system crashes or otherwise reboots unexpectedly.

    I think you are a little confused, Gary.  The journal we are talking
    about is buffered, yes, but only for a short period of time (e.g. less
    then a second).  This is NO DIFFERENT from what a journaling filesystem
    does.  When you type 'mkdir blah' in a journaling filesystem it does 
    *NOT* instantly write the operation out to the journal.  Disk performance
    would go completely to pot if it did that.

    All high performance filesystems buffer to some degree.  It is not 
    possible to build a high performance filesystem that does not buffer
    (that is, it would no longer be 'high performance' if it didn't).

    The key issue here is not that buffering is occuring, but how long the
    data remains in the buffer before it gets shipped off to hard storage
    somewhere (locally or over the net).  That's the issue.  And here when
    we consider something like, oh, a RAID system's battery backed ram...
    that would be considered hard storage, but it does not and cannot 
    replace the buffering that the kernel does.

    So what you gain during crash recovery is the ability to restore the
    filesystem to its state up to N seconds before the crash, where N 
    depends on the filesystem.  With a softupdates filesystem N could be
    upwards of 30 seconds.  With ReiserFS I would expect N to be in the
    < 10 second range.  But N will never be 0.  The journaling I am
    implementing would allow N to be programmed.  It could be as little
    as a millisecond or as much as the memory buffer can hold depending on
    the system operator's preference.  

    Even more key is the off-site capability.  If the journal is a TCP
    connection to another machine the buffering delay could be as little
    as a millisecond before the data gets to the target machine, and the
    local disks would not be impacted at all.  The originating machine 
    could immediately crash without really messing anything up, even if
    the data has not yet been committed to hard storage on the target
    machine.  The target machine could be configured to buffer the data
    again before comitting to hard storage, or it could commit it 
    immediately.  A key performance issue is that a target machine could 
    be dedicated to journaled backups of other machines in a cluster and
    basically only have to issue linear writes, yielding very high 
    performance. 

    So there are some very practical and desireable traits being discussed
    here.
					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>