Initial filesystem design synopsis.

Matthew Dillon dillon at apollo.backplane.com
Thu Feb 22 01:24:30 PST 2007


:Quick question...
:
:On Wed, 21 Feb 2007, Matthew Dillon wrote:
:
:>     - Infinite snapshots
:> 
:>     - Multi-master operation
:> 
:>     - Infinite logless Replication
:
:transid space: monotonic increasing on each replication target, or a 
:fine-grained synchronised timestamp*, or something else?
:
:Cheers,
:jan

    Monotonic increasing AND a fine-grained timestamp.  Low bits of
    the timestamp (sub-nanosecond equivalent) would simply be used to
    identify the replication target, allowing each target to 'allocate'
    transaction ids independantly (and also incidently tell us which 
    'master' was responsible for the original op that is now being
    replicated).  A newly created transaction id would at a minimum have
    to be larger then the last transaction id... and if this goes beyond
    the current 'real time', the host must sleep for a few microseconds
    to allow real time to catch up.  (In reality the granularity can be
    selected such that it is possible to allocate hundreds of thousands
    or millions of transids a second across the entire cluster, so this
    isn't an issue).

    The transaction id must be translatable into a timestamp of sorts
    (beyond the monotonic requirement), just to make snapshot handling
    sane.

    The problem with such a scheme is, of course, that a host which is
    not properly time synchronized can throw a big wrench in the works.
    And, also, conceviably someone could set the system time 
    to 0xffffffffffffffff and the filesystem would barf (not be able to
    allocate any new transaction ids because, well, it just ran out!).

    Sanity checks in the code can handle unsynchronized hosts, guarentee
    monotonic increasing transaction ids, and prevent the filesystem 
    from becoming corrupted.  Deliberately generating absurd time stamps
    would be a bigger problem... for example, basing your cluster on a
    single machine's RTC would be a bad idea.  At the very least you
    would want an NTP-synchronized time source.

    Monotonic increasing transaction ids are *CRITICAL* to replication
    protocols.  Absolutely critical.  It's the difference between having
    to keep a physical log of changes (with unbounded size), and just 
    having to store the last transaction ID you had synchronized to.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Kernel mailing list