cvs commit: src/sys/kern vfs_cache.c vfs_syscalls.c vfs_vnops.c vfs_vopops.c src/sys/sys namecache.h stat.h

Thu Aug 25 15:10:49 PDT 2005

:
:On Thu, Aug 25, 2005 at 01:17:55PM -0700, Matthew Dillon wrote:
:>     I don't know about imon under Linux, but kqueue on the BSDs doesn't
:>     even come *close* to providing the functionality needed, let alone
:>     providing us with a way monitor changes across distinct invocations.
:>     Using kqueue for that sort of thing a terrible idea.
:
:Trying to reliable detect what changed between two invocations based on
:any kind of transaction ID is a very bad idea, because it does add a lot
:of *very* nasty issues. I'll comment on some of them later.

    Joerg, please keep in mind that I've written a database.  A real one.
    From scratch.  The whole thing.  The site is defuct but the white paper
    is still there:

    http://www.backplane.com/docs/drdbms1.html

    So before you start commenting on what you think is wrong, read
    that paper and then you will understand how well I understand
    cache coherency and transactional database algorithms.

:>     I'm not sure I understand what you mean about not reaching all
:>     parent directories.  Perhaps you did not read the patch set.  It
:>     most certainly DOES reach all parent directories, whether they've been
:>     read or not.  That's the whole point.  It goes all the way to '/'.
:
:It only handles vnode changes for entries which are already in the name
:cache. So it is incomplete. It can't behave otherwise without keeping
:the whole directory tree in memory, but that doesn't solve the problem.

    The namecache is fully coherent.  If a vnode exists for a file that
    has not been unlinked, its namecache entry will exist.  That is part
    of the namecache work I did for DragonFly.  It isn't true in FreeBSD,
    it IS true in DragonFly.  If a file HAS been unlinked, well, its no
    longer in the filesystem is it?  So we don't care any more.

    The entire directory tree does not need to be in memory, only the
    pieces that lead to (cached) vnodes.  DragonFly's namecache subsystem
    is able to guarentee this.

:...
:>     to just the elements that have changed, and to do so without having to
:>     constantly monitor the entire filesystem.
:
:Moniting filesystems is not always a good idea. Allowing any
:user/program to detect activity in a subtree of the system can help to
:detect or circumvent security measurements.

    I'm not particularly worried about such a coarse detection method,
    but it could always be restricted to root if need be (like st_gen is).
    It's hardly a reason to not do something.

:>     The methodology behind the transaction id assignments can make this
:>     a 100% reliable operation on a *RUNNING*, *LIVE* system.  Detecting
:>     in-flight changes is utterly trivial.
:
:On a running system, it is enough to either get notification when a
:certain vnode changed (kqueue modell) or when a vnode changed (imon /
:dnotify model). Trying to detect in-flight changes is *not* utterly
:trivial for any model, since even accurate atime is already difficult to
:achieve for mmaped files. Believing that you can *reliable* backup a
:system based on VOP transactions alone is therefore a dream.

    This is not correct.  It is certainly NOT enough to just be told
    when an inode changes.... you need to know where in the namespace
    the change occured and you need to know how the change(s) effect
    the namespace.  Just knowing that a file with inode BLAH has been
    modified is not nearly enough information.

    Detecting in-flight changes is trivial.  You check the FSMID before
    descending into a directory or file, and you check it after you ascend
    back out of it.  If it has changed, you know that something changed
    while you were processing the directory or file and you simply re-recurse
    down and rescan just the bits that now have different FSMID's.
    Its a simple recursive algorithm and it works just fine.  If a file
    is changing a whole lot, such that you can't take a snapshot before 
    it changes again, then you need finer-grained information... which is
    exactly what the journal part of the system is capable of giving you, 
    but even without that the program attempting to do the backup is fully
    able to detect that there might be a potential problem and can either
    try to brute force its way through (retry), or do something smarter,
    or just report the fact.  It's a lot better then you get with 'dump',
    that's for sure.

:>     Nesting overhead is an issue, but not a big one.  It's a very solvable
:>     problem and certainly should not hold up an implementation.  The only
:>     real issue occurs when someone does a write() vs someone else stat()ing
:>     a directory along the parent path.  Again, very solvable and certainly
:>     not a show stopper in any way.
:
:It is a big issue, because it is not controllable. With both kqueue,
:imon and dnotify it can be done *selectively* for filesystems where it
:is needed and wanted. Even my somewhat small filesystems have already
:over a million inodes. Just trying to read them would already create a
:lot more (memory) IO just to update the various atimes.

    And its utterly trivial to do the same with FSMID's, because they are
    based on the namecache topology all we need is a mechanism that flags
    the part of the topology we care about IN the topology itself.
    The namecache records in question (representing the base of the 
    hierarchy being recorded) are then simply locked into the system
    and all children of said records inherit the flag.  Poof, done.

    But the plain fact of the matter is that for 99.9% of the installations
    out there, doing it globally will not result in any significant 
    performance impact.  We are talking about a few hundred nanoseconds
    per write(), even with a deep hierarchy.  And as I indicated earlier,
    the performance issues are a very solvable problem anyway so it is
    hardly something you hold up and say 'we can't do it because of this'. 

:...
:>     is certainly entirely possible to correctly implement the desired
:>     behavior.
:
:As soon as you try to make it persistence you add the problem of how
:applications should behave after a reboot. Since you mentioned
:backups, let's just discuss that. The backup program reads a change
:just after was made by a program, but before it has hit the disk. The
:FSMID it sends to the backup server is therefore nowhere recorded on
:disk (and doing that would involve quiet a lot of performance
:penalties). Now the machine is "suddenly" restarting. Can you ensure
:that the same FSMID is not reused, in which case the state of the
:filesystem and the state of the backup is inconsistent? Sure, the
:program can try to detect it, but that would make the entire concept of
:FSMID useless.
:
:Joerg

    These are all fairly trivial problems.  Yes, in fact you can *EASILY*
    determine that an FSMID hasn't made it to disk, simply by adopting
    a database-style transaction id (which that white paper discusses a
    bit, I believe).  In fact, if you did not have a transactional id
    recorded in the filesystem (which we don't right now), it would be
    very difficult to resynchronize something like the live journal with
    the on-disk filesystem.

    Think of transactional id's... the FSMID's we would record on the disk,
    as being snapshot id's.  They don't tell us exactly what is on the disk
    but they give us a waypoint which tells us that all transactions occuring
    <= the recorded FSMID are *definitely* on the disk.  By comparing those
    id's against id's we store in the journal transactions we can in fact
    very easily determine which transactions MAY NOT have made it to disk,
    and rerun the journal from point A to point B for the affected elements.
    That is a whole lot more robust then anything else that currently exists
    for any filesystem, and in fact I believe it would allow us to provide
    recovery guarentees based solely on the journal data being
    acknowledged, rather then having to wait for the filesystem to catch
    up to the journal.  That's a big deal.

    For example, softupdates right now is not able to guarentee data
    consistency.  If you crash while writing something out then on reboot
    you can wind up with some data blocks full of zero's, or full of old
    data, while other data blocks contain new data.  The FSMID for the file
    would tell us how far back in the journal we need to go to recover
    the potentially missing data.  There is nothing else that gives us that
    information, nothing else that tells us how far back in the journal
    we would have to go to recover all the lost data.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>