cvs commit: src/sys/kern vfs_cache.c vfs_syscalls.c vfs_vnops.c vfs_vopops.c src/sys/sys namecache.h stat.h
Matthew Dillon
dillon at apollo.backplane.com
Fri Aug 26 11:37:15 PDT 2005
:On Thu, Aug 25, 2005 at 03:09:21PM -0700, Matthew Dillon wrote:
:> The entire directory tree does not need to be in memory, only the
:> pieces that lead to (cached) vnodes. DragonFly's namecache subsystem
:> is able to guarentee this.
:
:*How* can it guaranty that without reading the whole directory tree in
:memory first? Unix filesystems have no way to determine in which
:directories an inode is linked from. If you have /dir1/link1 and
:/dir2/dir3/link2 as hardlinks for the same inode, you can't correctly
:update the FSMID for dir2 without having read dir3 first, simply because
:no name cache entry exists.
This is true of hardlinks, yes, but if the purpose is to mirror
then it doesn't really matter which path is used to get to the file.
And from an auditing and security standpoint you don't have to worry
about pre-existing 'random' hardlinks going to places that they shouldn't,
because that's already been checked for. What you do want to know about
are newly created hardlinks in places where they shouldn't exist, and
that ability would not be impaired in the least. Also, directories
cannot be hardlinked, only files.
As problems go this one would have virtually no effect on the types
of operations that we want to be able to accomplish. You can't just
throw up your hands and put out a random situation that will hardly
ever occur in real life (and not at all for a huge chunk of potential
applications of the feature), and call that a showstopper.
If it turned out that the file hardlink issue interferes with a certain
type of operation that we desire to have, it is also a very solvable
problem. Programs like cpdup can already deal with hardlinks, so the
real issue is whether you want to take the hit of scanning the entire
directory tree to find the links or whether you want to maintain a
lookaside database and use the journal to keep it up to date.
:> :On a running system, it is enough to either get notification when a
:> :certain vnode changed (kqueue modell) or when a vnode changed (imon /
:> :dnotify model). Trying to detect in-flight changes is *not* utterly
:> :trivial for any model, since even accurate atime is already difficult to
:> :achieve for mmaped files. Believing that you can *reliable* backup a
:> :system based on VOP transactions alone is therefore a dream.
:>
:> This is not correct. It is certainly NOT enough to just be told
:> when an inode changes.... you need to know where in the namespace
:> the change occured and you need to know how the change(s) effect
:> the namespace. Just knowing that a file with inode BLAH has been
:> modified is not nearly enough information.
:
:The point is that the application can determine in which inodes it is
:interested in and reread e.g. a directory when it has changed. There are
:some edge cases which might be hard to handle without additional
:information (e.g. when a link is moved outside the currently supervised
:area and you want to continue it's supervision. That's an entirely
:different question though.
No. The problem is that the application (such as a mirroring program)
could be interested in ALL THE INODES, not just some of them. Monitoring
inodes doesn't help you catch situations where new files are created,
nor does it help you if you want to monitor activity on an entire
subtree (which could contain thousands of directories and millions of
files), or any situation where you need to monitor more then a handful
of inodes. The kqueue approach is just plain stupid, frankly. It is
totally unscaleable and totally insufficient when dealing with terrabyte
filesystems.
:...
:> back out of it. If it has changed, you know that something changed
:> while you were processing the directory or file and you simply re-recurse
:> down and rescan just the bits that now have different FSMID's.
:
:But it is also very limited because it doesn't allow any filtering on
:what is interesting. In the worst case you just update all the FSMIDs
This is incorrect. I just said in my last email that you *CAN* filter
on what is interesting. Maybe not with this first commit, but the basic
premise of using the namecache topology not only for monitoring but also
for configuration and control is just about the only approach that
will actually work with regards to implementing a filtering mechanism,
because it can cover millions of files and directories with very little
effort and because it can be inclusive of files or dirs that have not
yet been created.
What you are proposing doesn't even come close to having the monitoring
and control capabilities that we need.
:for nothing. It also means as long as there is no way to store them
:persistenly that you can't free namecache entries without having to deal
:with exactly those cases in applications. Storing them persistenly has
:to deal with unrecorded changes which wouldn't be detected. Just think
:about dual-booting to FreeBSD.
There is nothing anyone can do about some unrelated operating system
messing around with your filesystems, nor should we restrict our activities
based on the possibility. This is a DragonFly feature for systems running
DragonFly, not for systems running FreeBSD or Linux or any other OS.
:> For example, softupdates right now is not able to guarentee data
:> consistency. If you crash while writing something out then on reboot
:> you can wind up with some data blocks full of zero's, or full of old
:> data, while other data blocks contain new data.
:
:That's not so much a problem of softupdates, but of any filesystem without very
:strong data journaling. ZFS is said to do something in that area, but it
:can't really solve interactions which cross filesystems. The very same
:problem exists for FSMIDs. This is something where a transactional database
:and a normal filesystem differ: filesystems almost never have full
:write-ahead log files, because it makes them awefully slow. The most
:important reason is that applications have no means to specify explicit
:transaction borders, so you have to assume an autocommit style usage
:always.
:
:Joerg
I have no idea what you are trying to say here, Joerg. You seem to be
throwing up your hands and saying that we shouldn't implement it
because it isn't perfect, but your proposal to monitor inodes (aka
via kqueue) can't handle even a tenth of the types of operations I
want DragonFly to be able to do.
Insofar as persistent storage goes, we have several choices. My number
one choice is to integrate it into UFS, because it's almost trivial to
do so. A filesystem certainly does *NOT* have to be natively journaled or
transactional in any way... all we have to do is update the inode with
the new FSMID *after* the related data has been synchronized, and that's
a very easy algorithm. It doesn't even have to sync the file, there is
nothing preventing us from writing out transitional FSMID's (instead of
the latest one) based on what we've synced to disk. This is a far
easier situation to deal with then e.g. softupdates because we do not
have to track crazy interactions within the filesystem. The FSMIDs are
allowed to be 'behind' the synced data as long as the synced data does
not get ahead of the high level journal.
More to the point, though, it's a really bad idea to limit features
simply because some filesystem written 20 years was not originally
built to handle it. DragonFly is about pushing the limits, not about
accomodating them. The journaling is a big leap for BSD operating
systems, but there is a big gap inbetween that needs to be filled for
those sysads that want to have alternative backup and auditing
methodologies but who want to avoid doing continuous and full scans of
their (huge) filesystems, not to mention other potential features.
DATABASE TRANACTIONS PRIMER
I sense that there is a fundamental misunderstanding of how database
transactions can actually work here, and how FSMID's relate to the
larger scheme, one that is probably shared by many people so I will
endevour to explain it.
If you take a high level view of a database-like transaction you
basically have a BEGIN, do some work, and a COMMIT. When you get the
acknowledgement from the COMMIT that is a guarentee that if a crash
occured right there your transaction will still be good after the reboot.
But accomplishing this does not imply that the data must be synchronized
to disk through the filesystem, nor does it imply that other, later
transactions which had not yet been acknowledged couldn't be written to
disk. In our environment it only means that the operation must be
journaled to persistent store (which is DIFFERENT from the activity going
on in the filesystem), and that after a crash the system must be able
to UNDO any writes that were written to the disk or to the journal
that were related to UNCOMMITTED transactions.
If you think about it, what this means is that the actual disk I/O we do
can be a lot more flexible then our high level perception of the
transaction. It's very important that people understand this.
Persistent FSMIDs fit into this idea very well. When used as a recovery
mechanism all we have to do is guarentee that the transactions related
to the FSMID we are writing have already gotten onto the disk. Since we
can delay FSMID synchronization indefinitely, this is a trivial
requirement that does not need the sophistication of softupdates and
does not preclude, e.g. a lookaside database file to hold the FSMIDs
for filesystems that cannot store them persistently.
Our high level journal can be used to accomplish tranasctional unwinding,
that is to UNDO changes made to the filesystem that are not
transactionally consistent. In the context of a filesystem, what
this means is that we can use our high level journal to make the
persistent FSMID completely consistent with the filesystem state after
a crash either by undoing filesystem operations to bring the filesystem
back to the state as of the stored FSMID, or by regenerating the FSMID
from the high level journal. WE CAN GO BOTH FORWARDS AND BACKWARDS IN
ORDER TO MAKE THE FILESYSTEM STATE SANE AGAIN AFTER A CRASH.
THE ONLY REQUIREMENT for being able to accomplish this is that the
filesystem operations in question not be synchronized to the disk until
the related journal entry has been acknowledged. Note that I am not
saying that the operations should stall, I am simply saying that they
would not be synchronized to the disk... they would still be in the
buffer cache, and programs would still see instant updates to the FSMID
and the file data.
Also remember that unlike softupdates, the FSMID we write to the disk
does not have to be the latest one, so we do not get stuck in a situation
where a program that is continuously writing to a file would prevent
data buffers from being written out to the platter. That is not the case.
All that it means is that the FSMID written to the disk may be slightly
behind the FSMID stored in the journal, and both will be behind the
real-time FSMID stored in system memory.
Now it turns out that accomplishing this *ONE* requirement can be done
solely within the high level buffer cache implementation. It does not
require interactions with the filesystem. e.g. UFS does not need to
have any knowledge about the interactions.
On crash recovery the FSMIDs can be used by the journaling subsystem
to determine not only how far back in the journal it has to go to
rerun the journal, but also to help the journaling subsystem figure out
which portions of the filesystem data might require an UNDO.... in the
context of the current system that would prevent, e.g. the large sections
of ZEROs you get in softupdates filesystems when you crash. The journal
would be able to guarentee either the old data or the new data. Crash
recovery after a reboot would also be able to update the stale FSMIDs in
the filesystem from the journal (where they are also stored), maintaining
a level of consistency across crashes that most UNIX systems cannot do
today.
But why limit ourselves to that? What if we want to guarentee that a
high level operation, such as an 'install' command, which encompasses
many filesystem operations, either succeeds in whole or fails in whole
across a crash condition? With a journal and implementing this one data
ordering requirement, WE CAN MAKE THAT GUARENTEE! In fact, the
combination of persistent FSMIDs and journaling would allow us to implement
meta transactions that could encompasses gigabytes worth of operations.
It could give us a transactional capability that is visible at the coarse
'shell' level, eventually.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Commits
mailing list