HAMMER update 19-June-2008 (56C) (HEADS UP - MEDIA CHANGED)

Matthew Dillon dillon at apollo.backplane.com
Fri Jun 20 10:04:46 PDT 2008


:
:That's harmless for atime, but you really want mtime to be properly  
:synchronised with the last data update (and to stay that way across an  
:undo). Ideally, timestamp data records and hold mtime as a reference  
:to the last one updated (or something like that).
:
:--
:Bob Bishop          +44 (0)118 940 1243
    
    Yah, I agree.  Here's a quick summary of the issues:

	* UNDO records are used to compartmentalize atomic changes which
	  cover multiple disk blocks.  For example, if you 'rm' a file
	  and a crash occurs, you want the state of the filesystem to
	  either show the file and its directory entry both removed, or show
	  the file and its directory entry both still present.

	* Updates to the inode_data, which holds the stat/chmod info for
	  a file object, typically requires rolling a new inode_data record
	  with the old one still available via the filesystem history.  For
	  example, if you append some stuff to an existing file an old 
	  version of the inode_data must be present in order to 'see' the
	  previous state of the file (in particular, the previous st_size
	  of the file).

	* BUT, having to do any of the above when updating atime and mtime
	  would be really expensive.

	  - atime gets updated all the time. We definitely do not want to
	    roll UNDO records *or* new inode_data records.

	  - mtime gets updated all the time in certain situations, such as
	    when overwriting a file (e.g. in ways that do not modify the
	    file's size).

	  - mtime is often used to uniquely determine whether a file has
	    been modified.

	* And, finally, we want mirroring to work properly even if the
	  filesystem is mounted 'nohistory' (told not to roll new
	  inode_data records).  Or, for that matter, if individual files
	  are chflagged 'nohistory'.

    The bane of HAMMER's design is that we absolutely do not want to roll
    new inode_data records unless we have to, so here is what I am going to
    do:

	* ATime will be updated asynchronously and will not be CRCd, so
	  the B-Tree element's CRC field does not have to be updated.
	  (thus no UNDO records need to be generated either).

	* MTime will be updated semi-synchronously and will be CRCd.
	  (It will be fully synchronous from the point of view of
	  anyone using the filesystem, of course).  UNDO records will 
	  be generated but new inode_data records will not have to be
	  created.  The mtime will be updated in-place.

    That solves the contemporary-use situations.  And I think I have a
    solution for mirroring too.  Mirroring will depend on a serial number
    field stored along with the B-Tree elements, with the highest serial
    number in the node propogated upwards towards the B-Tree root.
    Ultimately the B-Tree root node will wind up with a serial number
    representing the most recent change made to the filesystem.

    As I think about it, the serial number itself can be updated atomically
    using UNDO records, and the update can occur even if a new inode_data
    record is not rolled (so the serial number would be updated in-place
    in the B-Tree element and propogated upwards towards the B-Tree root).

    That makes it work with 'nohistory' mounts or files and also means
    serial number generation will be compatible with the MTime update
    mechanic, allowing us to roll new serial numbers for MTime updates
    without having to insert new B-Tree elements.

    We would not roll new serial numbers for ATime updates though (can you
    imagine the load that would create?).  I think ATime will have to
    operate independantly on the mirrors, at least for now.

    This will give the mirroring code the ability to store just one thing...
    the serial number of where it 'left off' the last time, and it can use
    that number to then scan the B-Tree from the root node downward and
    only go down the branches with serial numbers >= the mirror's saved
    serial number.  The result will be that the mirroring code can very
    quickly locate records modified relative to the last time it ran,
    without needing record queues.  It will be possible to do it in batch
    or semi-real-time.  Plus the mirroring will be completely disconnected
    from the flow of modifications made to the filesystem and thus not
    effect write performance at all.

    I don't think there are any major gotchas with my plan.  The only question
    mark is the I/O load propogating the serial numbers to the root of the
    B-Tree will entail, but I think I can optimize that.  Since UNDO records
    are generated I can do massive aggregation of serial number updates.
    Besides, how big a performance price are people willing to pay to get
    premium mirroring?  Probably pretty big.

    --

    There's a little side story here, going back to the Backplane Inc
    Database.  When I was doing Backplane Inc, a start-up that sadly fell
    in the dot-com crash, I had a batchable, restartable, totally
    disconnected mirroring capability that effectively allowed me to mirror
    the production databases to a backup box in my home over a not very
    reliable modem connection.  It would always get behind during the day,
    then catch up over night.  It didn't care about frequent disconnects,
    it didn't care about the ludicrously low modem bandwidth... it just
    worked.

    That's how I want HAMMER's mirroring to work.

    Ultimately the serial numbers will serve a second purpose, and that 
    will be as a rendezvous point for clustered filesystem operation,
    where the machine cluster is accessing multiple mirrors of the same
    filesystem which might be in various states of catch-up.  The
    cluster protocols will agree on a serial number, and then be able to
    access the data from any mirror whos record(s) are updated through that
    serial number.  The Backplane database was also able to do the same
    thing, using a quorum to agree on the transaction id represented the
    desired data, and then pulling that data from any master or slave copy
    of the database that had that transaction id.

    That's how I want HAMMER's clustered filesystem access to work.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Kernel mailing list