Journaling layer update - any really good programmer want to start working on the userland journal scanning utility ? You need to have a LOT of time available!

Fri Mar 4 14:43:54 PST 2005

Matthew Dillon wrote:

:I have a question about this. How far can the UNDO system go? Will it
:be some kind of redo-log (think VxFS here) or will it allow arbitrary
:undo levels? Maybe it sounds crazy, but you could have 'unlimited' undo
:where unlimited amounts to available free disk space. It complicates
:the free block management a bit, but it's certainly doable. Basically
:the journal code should have an undo bitmap. It can allocate any free
:data block to store undo information. As soon as a datablock is needed
:by the sytem the oldest undo datablock gets used to store data. The
:filesystem doesn't need to know about it, just the journaling code,
:although balloc now needs to go through the journal code to make sure
:the undo blocks bitmap gets updated as well. This bitmap could live
:anywhere in the disk.
    The way the UNDO works is that the journal is made 'reversable', meaning
    that you can run the journal forwards and execute the related filesystem
    operations in order to 'track' a filesystem, or you can run the journal
    backwards in order to reverse-time-index a filesystem.  So what you get
    is an infinitely fine-grained undo capability.
    So, for example, lets say someone does a file truncation.  A forward-only
    journal would store this:
    truncate {
	file path / vnode id
	at offset BLAH
    }
    A reversable journal would store this:

    truncate {
	file path / vnode id
	at offset BLAH
	UNDO {
	    extend to offset original_blah
	    write {		(the data that was destroyed)
		at offset
		data
	    }
	    write {
		at offset
		data
	    }

	}
    }

    If someone were to do a write() which overwrites prior data, the UNDO
    section would have to contain the original data.
    If you were to remove a 4GB file the journal would have to store 4GB 
    worth of data to UNDO that operation.

    But what you get is a complete audit trail, the ability to run the 
    journal forwards or backwards when reconstructing a filesystem or
    database, the ability to not only undelete a file but also to restore
    it (or an entire subhierarchy) as-of a particular date, the ability to
    figure out exactly how a security event might have occured, and a host of
    other capabilities.

    The UNDO capability will be optional, of course, but I see a lot of
    people, including myself, using it.  All of that disk space has to be 
    good for something eh? 

    --

    I just finished running a buildworld test with /usr/obj journaled 
    (without UNDO since it isn't done yet).  The buildworld only took 10
    seconds longer to run!  The journal generated from /usr/obj was about
    785 MB uncompressed, and 231 MB compressed.  Of course, nobody would
    want to journal /usr/obj in real life, but it's a good test!

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>

:Sounds very well thought out, Matt. Definitely worth a look. I'll try
:to be in IRC during the weekend so maybe we can come up with something.
:
:Cheers,
:-- 
:Miguel Mendez <flynn at xxxxxxxxxxxxxxxxxx>
This is sounds more like playback' than journaling..

starting to sound similar to a VAX editing playback buffer or
hpfs-386 fs 'feature', even.
hpfs is NOT a journaled fs, but successive runs of the deeper levels of
chdsk would progressively recover deeper and deeper archeological strata.
The problem with hpfs was that it took 'too d**n long' to do even a 
light-once-over
on a mount that was marked dirty  - even in a 1 GB partition.... (fast 
SCSI too)

Chkdsk hpfs on anything over 4GB, at level '2' was a ticket to an 
old-age home.
By the time the data had been recovered, you had forgotten why it mattered.

Hence JFS, beloved of Warpenvolk und AIXen.

JFS never takes first place in run-offs against other 'optimized for...' 
fs,
but it is well-balanced enough to always be in the top few.

It doesn't really preserve much at all of what was 'in-process'.
It just lets you know that the *other* 99.999% of the disk is still OK.
Beats Coroner's Tool Kit all hollow....
It seems to me that audit-trail / undo pitches the journaling-advantage 
baby
right out with the bathwater.

Performance may be perfectly reasonable when sane undo depth is set.

But some fool will set it to an outrageous number, run tests, and publish
in {other OS}-Journal/World how poorly it compares with {some other fs}, 
and
the damage will be enduring.

Can the number of undo's above some reasonable number (1?, 2?) be made to
require an intentional compile-time flag setting?
IMNSHO, good-quality hardware RAID, set write-through, not write-back, and
mounted sync, not softupdates, are needed *anyway* if the data is important.
And you then rsync that off-site...
No amount of same-media, same-spindle fallback is worth much without an 
electron
microscope, cleanroom, US$100,000 to cover the labor costs...
and a guarantee from on-high that the whole facility will neither be 
burgled nor go up in smoke.

Disks are magically reliable per unit cost these days.  They do still 
die, get stolen,
zapped, dropped, and - more often - get newfs'ed. rm -o or srm'ed by 
accident....

Better to distribute the fs over the wire  - inherent disaster recovery 
- than make
it too complex locally.  Look at ssa and fc-al.  Hardware solutions of 
long-standing.

Firewire-2, USB-2 and GigE with IP NAS can bring that to us 
less-affluent folks too.....

Bill