Working on vnode sequencing and locking - HEAD will destabilize for a little bit

Thu Aug 10 10:03:29 PDT 2006

:>    Finally, the vnode locking will be moved out of the kernel layer and
:>    into the filesystem layer.  This is again for userland VFS and clustering.
:>    It will mean that we do not have to hold a vnode lock across a filesystem
:>    operation... we've already seen what that can lead to with NFS when an
:>    NFS server goes down.
:
:Good stuff (and long overdue imho). You may not know that I helped Kirk
:come up with the original vfs and, even then, argued that the vnode locking
:made things too complicated. Since then, it got even more complex, due to
:dynamic allocation of vnodes, etc. (The original vfs used a fixed vnode
:table allocated at boot. Although primitive, it had the big advantage that
:a vp always worked, so you could just check on the vnode's status with it.)

    Yah, I think this is where some of the lockmgr cruft like LK_DRAIN
    came from.  Most of that has now been removed from DragonFly though 
    there are still a few races where DragonFly has to check pointers
    after a blocking lock has been obtained.  For example, line 722
    in kern/vfs_cache.c.

    However, I have finally fixed the issue of a vnode possibly getting
    ripped out from under a caller trying to vn_lock() it.  The vref()
    code is now atomic and does not block in any way which means that
    we can obtain a ref on an ephermal vnode pointer prior to attempting
    to lock it (an ephermal vnode pointer being a vnode pointer which is
    not ref'd, such as is stored in a namecache record or the inode hash
    table or accessed via a mount's vnode list).   The worst that happens
    is that the vnode becomes VRECLAIMED by the time the blocked lock
    returns.  The ref prevents it from being ripped out from under the
    lockmgr and also prevents it from being reused.  The caller need only
    check to see if it is VRECLAIMED to determine whether a retry is 
    needed.

    I am not quite confident enough to actually free() a dynamically 
    allocated vnode, but it is theoretically possible to do so now
    without having to worry about stale pointer references.

:OSF did something similar and changed the directory offset caching etc. to
:use a "soft reference". By "soft reference" I mean there was a rev cnt that
:was incremented each time the item (directory) changed, which they saved
:with the cached ref. When they went to use the cached ref, they just compared
:the ref cnt, to see if it was stale. (One of the main reasons for the locks
:above the vnode boundary was so that directories could be locked against
:changes between lookup and update, so directory offset caching would still
:work. The other was related to lookup/rename, which you are already well
:aware of:-)
:
:Good luck with it, rick

    The DragonFly namecache was completely rewritten a year or so ago.
    It is 100% deterministic now so e.g. when a RENAME is issued the
    appropriate record(s) in the namecache are actually moved around in
    the namecache topology.  A namecache record can represent both
    existant and non-existant names so it is possible to 'lock' the
    namespace for a file to be created before the file actually exists.

    DragonFly has a whole set of 'new' namespace VOP's, e.g. VOP_NRENAME,
    VOP_NCREATE, and so forth, which take locked namecache pointers as
    arguments instead of vnode, directory vnode, and cnp pointers.  This
    removes the need to lock the governing directories though, of course,
    the old filesystems such as UFS still do so since they still use all
    the old LOOKUP side effect + VOP_<BLAH> combination.  VOP_NRENAME 
    takes just three arguments: a pointer to a locked namecache record
    representing the source, another one representing the target, and
    a cred, and that's it.

    I'm not in favor of trying to cache directory offsets in a higher
    kernel layer.  I think directories are very filesystem specific and
    any caching should be done by the filesystem itself.  The namespace
    is a different matter.  Since DragonFly's goal is clustering, the
    namespace must be managed and controlled by the kernel above the
    filesystem layer.  A filesystem could theoretically cache offset
    information in the kernel-managed namecache record but I'm not sure
    how good an idea it is.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>