Working on vnode sequencing and locking - HEAD will destabilize for a little bit
Matthew Dillon
dillon at apollo.backplane.com
Thu Aug 10 10:03:29 PDT 2006
:> Finally, the vnode locking will be moved out of the kernel layer and
:> into the filesystem layer. This is again for userland VFS and clustering.
:> It will mean that we do not have to hold a vnode lock across a filesystem
:> operation... we've already seen what that can lead to with NFS when an
:> NFS server goes down.
:
:Good stuff (and long overdue imho). You may not know that I helped Kirk
:come up with the original vfs and, even then, argued that the vnode locking
:made things too complicated. Since then, it got even more complex, due to
:dynamic allocation of vnodes, etc. (The original vfs used a fixed vnode
:table allocated at boot. Although primitive, it had the big advantage that
:a vp always worked, so you could just check on the vnode's status with it.)
Yah, I think this is where some of the lockmgr cruft like LK_DRAIN
came from. Most of that has now been removed from DragonFly though
there are still a few races where DragonFly has to check pointers
after a blocking lock has been obtained. For example, line 722
in kern/vfs_cache.c.
However, I have finally fixed the issue of a vnode possibly getting
ripped out from under a caller trying to vn_lock() it. The vref()
code is now atomic and does not block in any way which means that
we can obtain a ref on an ephermal vnode pointer prior to attempting
to lock it (an ephermal vnode pointer being a vnode pointer which is
not ref'd, such as is stored in a namecache record or the inode hash
table or accessed via a mount's vnode list). The worst that happens
is that the vnode becomes VRECLAIMED by the time the blocked lock
returns. The ref prevents it from being ripped out from under the
lockmgr and also prevents it from being reused. The caller need only
check to see if it is VRECLAIMED to determine whether a retry is
needed.
I am not quite confident enough to actually free() a dynamically
allocated vnode, but it is theoretically possible to do so now
without having to worry about stale pointer references.
:OSF did something similar and changed the directory offset caching etc. to
:use a "soft reference". By "soft reference" I mean there was a rev cnt that
:was incremented each time the item (directory) changed, which they saved
:with the cached ref. When they went to use the cached ref, they just compared
:the ref cnt, to see if it was stale. (One of the main reasons for the locks
:above the vnode boundary was so that directories could be locked against
:changes between lookup and update, so directory offset caching would still
:work. The other was related to lookup/rename, which you are already well
:aware of:-)
:
:Good luck with it, rick
The DragonFly namecache was completely rewritten a year or so ago.
It is 100% deterministic now so e.g. when a RENAME is issued the
appropriate record(s) in the namecache are actually moved around in
the namecache topology. A namecache record can represent both
existant and non-existant names so it is possible to 'lock' the
namespace for a file to be created before the file actually exists.
DragonFly has a whole set of 'new' namespace VOP's, e.g. VOP_NRENAME,
VOP_NCREATE, and so forth, which take locked namecache pointers as
arguments instead of vnode, directory vnode, and cnp pointers. This
removes the need to lock the governing directories though, of course,
the old filesystems such as UFS still do so since they still use all
the old LOOKUP side effect + VOP_<BLAH> combination. VOP_NRENAME
takes just three arguments: a pointer to a locked namecache record
representing the source, another one representing the target, and
a cred, and that's it.
I'm not in favor of trying to cache directory offsets in a higher
kernel layer. I think directories are very filesystem specific and
any caching should be done by the filesystem itself. The namespace
is a different matter. Since DragonFly's goal is clustering, the
namespace must be managed and controlled by the kernel above the
filesystem layer. A filesystem could theoretically cache offset
information in the kernel-managed namecache record but I'm not sure
how good an idea it is.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list