cvs commit: src/sys/kern vfs_cache.c vfs_syscalls.c vfs_vnops.c vfs_vopops.c src/sys/sys namecache.h stat.h

Matthew Dillon dillon at apollo.backplane.com
Fri Aug 26 11:37:15 PDT 2005


:On Thu, Aug 25, 2005 at 03:09:21PM -0700, Matthew Dillon wrote:
:>     The entire directory tree does not need to be in memory, only the
:>     pieces that lead to (cached) vnodes.  DragonFly's namecache subsystem
:>     is able to guarentee this.
:
:*How* can it guaranty that without reading the whole directory tree in
:memory first? Unix filesystems have no way to determine in which
:directories an inode is linked from. If you have /dir1/link1 and
:/dir2/dir3/link2 as hardlinks for the same inode, you can't correctly
:update the FSMID for dir2 without having read dir3 first, simply because
:no name cache entry exists.

    This is true of hardlinks, yes, but if the purpose is to mirror
    then it doesn't really matter which path is used to get to the file.
    And from an auditing and security standpoint you don't have to worry
    about pre-existing 'random' hardlinks going to places that they shouldn't,
    because that's already been checked for.  What you do want to know about
    are newly created hardlinks in places where they shouldn't exist, and
    that ability would not be impaired in the least.  Also, directories
    cannot be hardlinked, only files.

    As problems go this one would have virtually no effect on the types
    of operations that we want to be able to accomplish.  You can't just 
    throw up your hands and put out a random situation that will hardly
    ever occur in real life (and not at all for a huge chunk of potential
    applications of the feature), and call that a showstopper.

    If it turned out that the file hardlink issue interferes with a certain
    type of operation that we desire to have, it is also a very solvable
    problem.  Programs like cpdup can already deal with hardlinks, so the
    real issue is whether you want to take the hit of scanning the entire
    directory tree to find the links or whether you want to maintain a
    lookaside database and use the journal to keep it up to date.

:> :On a running system, it is enough to either get notification when a
:> :certain vnode changed (kqueue modell) or when a vnode changed (imon /
:> :dnotify model). Trying to detect in-flight changes is *not* utterly
:> :trivial for any model, since even accurate atime is already difficult to
:> :achieve for mmaped files. Believing that you can *reliable* backup a
:> :system based on VOP transactions alone is therefore a dream.
:> 
:>     This is not correct.  It is certainly NOT enough to just be told
:>     when an inode changes.... you need to know where in the namespace
:>     the change occured and you need to know how the change(s) effect
:>     the namespace.  Just knowing that a file with inode BLAH has been
:>     modified is not nearly enough information.
:
:The point is that the application can determine in which inodes it is
:interested in and reread e.g. a directory when it has changed. There are
:some edge cases which might be hard to handle without additional
:information (e.g. when a link  is moved outside the currently supervised
:area and you want to continue it's supervision. That's an entirely
:different question though.

    No.  The problem is that the application (such as a mirroring program)
    could be interested in ALL THE INODES, not just some of them.  Monitoring
    inodes doesn't help you catch situations where new files are created,
    nor does it help you if you want to monitor activity on an entire
    subtree (which could contain thousands of directories and millions of
    files), or any situation where you need to monitor more then a handful
    of inodes.  The kqueue approach is just plain stupid, frankly.  It is
    totally unscaleable and totally insufficient when dealing with terrabyte
    filesystems.

:...
:>     back out of it.  If it has changed, you know that something changed
:>     while you were processing the directory or file and you simply re-recurse
:>     down and rescan just the bits that now have different FSMID's.
:
:But it is also very limited because it doesn't allow any filtering on
:what is interesting. In the worst case you just update all the FSMIDs

    This is incorrect.  I just said in my last email that you *CAN* filter
    on what is interesting.  Maybe not with this first commit, but the basic
    premise of using the namecache topology not only for monitoring but also
    for configuration and control is just about the only approach that
    will actually work with regards to implementing a filtering mechanism,
    because it can cover millions of files and directories with very little
    effort and because it can be inclusive of files or dirs that have not
    yet been created.

    What you are proposing doesn't even come close to having the monitoring
    and control capabilities that we need.

:for nothing. It also means as long as there is no way to store them
:persistenly that you can't free namecache entries without having to deal
:with exactly those cases in applications. Storing them persistenly has
:to deal with unrecorded changes which wouldn't be detected. Just think
:about dual-booting to FreeBSD.

    There is nothing anyone can do about some unrelated operating system
    messing around with your filesystems, nor should we restrict our activities
    based on the possibility.  This is a DragonFly feature for systems running
    DragonFly, not for systems running FreeBSD or Linux or any other OS.

:>     For example, softupdates right now is not able to guarentee data
:>     consistency.  If you crash while writing something out then on reboot
:>     you can wind up with some data blocks full of zero's, or full of old
:>     data, while other data blocks contain new data.
:
:That's not so much a problem of softupdates, but of any filesystem without very
:strong data journaling. ZFS is said to do something in that area, but it
:can't really solve interactions which cross filesystems. The very same
:problem exists for FSMIDs. This is something where a transactional database
:and a normal filesystem differ: filesystems almost never have full
:write-ahead log files, because it makes them awefully slow. The most
:important reason is that applications have no means to specify explicit
:transaction borders, so you have to assume an autocommit style usage
:always.
:
:Joerg

    I have no idea what you are trying to say here, Joerg.  You seem to be
    throwing up your hands and saying that we shouldn't implement it 
    because it isn't perfect, but your proposal to monitor inodes (aka
    via kqueue) can't handle even a tenth of the types of operations I
    want DragonFly to be able to do.

    Insofar as persistent storage goes, we have several choices.  My number
    one choice is to integrate it into UFS, because it's almost trivial to
    do so.  A filesystem certainly does *NOT* have to be natively journaled or
    transactional in any way... all we have to do is update the inode with
    the new FSMID *after* the related data has been synchronized, and that's
    a very easy algorithm.  It doesn't even have to sync the file, there is
    nothing preventing us from writing out transitional FSMID's (instead of
    the latest one) based on what we've synced to disk.  This is a far 
    easier situation to deal with then e.g. softupdates because we do not
    have to track crazy interactions within the filesystem.  The FSMIDs are
    allowed to be 'behind' the synced data as long as the synced data does
    not get ahead of the high level journal.

    More to the point, though, it's a really bad idea to limit features
    simply because some filesystem written 20 years was not originally
    built to handle it.  DragonFly is about pushing the limits, not about
    accomodating them.  The journaling is a big leap for BSD operating
    systems, but there is a big gap inbetween that needs to be filled for
    those sysads that want to have alternative backup and auditing 
    methodologies but who want to avoid doing continuous and full scans of
    their (huge) filesystems, not to mention other potential features.

			DATABASE TRANACTIONS PRIMER

    I sense that there is a fundamental misunderstanding of how database
    transactions can actually work here, and how FSMID's relate to the 
    larger scheme, one that is probably shared by many people so I will
    endevour to explain it.

    If you take a high level view of a database-like transaction you
    basically have a BEGIN, do some work, and a COMMIT.  When you get the
    acknowledgement from the COMMIT that is a guarentee that if a crash 
    occured right there your transaction will still be good after the reboot.

    But accomplishing this does not imply that the data must be synchronized
    to disk through the filesystem, nor does it imply that other, later
    transactions which had not yet been acknowledged couldn't be written to
    disk.  In our environment it only means that the operation must be
    journaled to persistent store (which is DIFFERENT from the activity going
    on in the filesystem), and that after a crash the system must be able
    to UNDO any writes that were written to the disk or to the journal
    that were related to UNCOMMITTED transactions.

    If you think about it, what this means is that the actual disk I/O we do 
    can be a lot more flexible then our high level perception of the
    transaction.  It's very important that people understand this.

    Persistent FSMIDs fit into this idea very well.  When used as a recovery
    mechanism all we have to do is guarentee that the transactions related
    to the FSMID we are writing have already gotten onto the disk.  Since we
    can delay FSMID synchronization indefinitely, this is a trivial 
    requirement that does not need the sophistication of softupdates and
    does not preclude, e.g. a lookaside database file to hold the FSMIDs
    for filesystems that cannot store them persistently.

    Our high level journal can be used to accomplish tranasctional unwinding,
    that is to UNDO changes made to the filesystem that are not
    transactionally consistent.  In the context of a filesystem, what
    this means is that we can use our high level journal to make the 
    persistent FSMID completely consistent with the filesystem state after
    a crash either by undoing filesystem operations to bring the filesystem
    back to the state as of the stored FSMID, or by regenerating the FSMID
    from the high level journal.  WE CAN GO BOTH FORWARDS AND BACKWARDS IN
    ORDER TO MAKE THE FILESYSTEM STATE SANE AGAIN AFTER A CRASH.

    THE ONLY REQUIREMENT for being able to accomplish this is that the
    filesystem operations in question not be synchronized to the disk until
    the related journal entry has been acknowledged.  Note that I am not
    saying that the operations should stall, I am simply saying that they
    would not be synchronized to the disk... they would still be in the
    buffer cache, and programs would still see instant updates to the FSMID
    and the file data.  

    Also remember that unlike softupdates, the FSMID we write to the disk 
    does not have to be the latest one, so we do not get stuck in a situation
    where a program that is continuously writing to a file would prevent
    data buffers from being written out to the platter.  That is not the case.
    All that it means is that the FSMID written to the disk may be slightly
    behind the FSMID stored in the journal, and both will be behind the
    real-time FSMID stored in system memory.

    Now it turns out that accomplishing this *ONE* requirement can be done
    solely within the high level buffer cache implementation.  It does not 
    require interactions with the filesystem.  e.g. UFS does not need to 
    have any knowledge about the interactions.

    On crash recovery the FSMIDs can be used by the journaling subsystem
    to determine not only how far back in the journal it has to go to
    rerun the journal, but also to help the journaling subsystem figure out
    which portions of the filesystem data might require an UNDO.... in the
    context of the current system that would prevent, e.g. the large sections
    of ZEROs you get in softupdates filesystems when you crash.  The journal
    would be able to guarentee either the old data or the new data.  Crash
    recovery after a reboot would also be able to update the stale FSMIDs in
    the filesystem from the journal (where they are also stored), maintaining
    a level of consistency across crashes that most UNIX systems cannot do
    today.

    But why limit ourselves to that?  What if we want to guarentee that a
    high level operation, such as an 'install' command, which encompasses
    many filesystem operations, either succeeds in whole or fails in whole
    across a crash condition?  With a journal and implementing this one data
    ordering requirement, WE CAN MAKE THAT GUARENTEE!   In fact, the
    combination of persistent FSMIDs and journaling would allow us to implement
    meta transactions that could encompasses gigabytes worth of operations.
    It could give us a transactional capability that is visible at the coarse
    'shell' level, eventually.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>





More information about the Commits mailing list