nullfs stabilization I

Tue Jan 10 10:56:15 PST 2006

:hey,
:
:I know Matt wants a generic cache coherency layer, but until this is in 
:place, I propose attached patch to fix this 
:rename-race-and-file-vanishes bug in nullfs.  Furthermore, you can't 
:mount a nullfs from a nullfs as this will lead to a recursion stack 
:overflow double fault.  This has been fixed as well.
:
:I'm not sure if this single linked list is actually enough, because 
:renames on the lower layer won't trigger cache_inval's on the upper 
:layer at the moment.
:
:What do people think?
:
:cheers
:   simon

    Wow.  Ok.  Well, I don't want to dampen your wonderful coding because
    your patch is really quite nice!  *BUT* while I like the concept, I
    want to try to have as much of the sophistication in the cache_*() layer
    as possible and as little in the NULLFS layer as possible.  I'll
    explain what I'm thinking of, and why I want it this way, below:

    The pre-patch NULLFS code simply changes a_ops.  Your new patch is
    arguably more correct in that it also passes the correct NCP's to
    the lower layer(s) (that is, the NCP's representative of the lower
    layer's mount point), but in order to do so you are implementing
    the fine details in all the NULLFS VOP routines.  I think we need to
    avoid this.

    What I would like to do is implement all that logic in the 
    cache_*() code itself.  I *want* the nullfs layer to be able to
    simply change ap->a_head.a_ops and recurse down.  I don't want the
    nullfs layer to have to determine and pass the lower layer NCP's to
    the lower layer.

    I know this sounds a bit wrong, because then the ncp->nc_mount field
    will be 'wrong' from the point of view of the lower layer, but in
    fact lower layers do *NOT* use nc_mount.  It's a bug that nullfs does
    (or we can start calling it a bug :-)).

    The way you get to the mount point is via ap->a_head.a_ops->vv_mount.
    The ncp->nc_mount field was never meant to be used by the VFS code...
    only to be used internally by cache_*().  It looks like I broke my own
    rule... I have two references in NFS, but that's for later.  So this
    means:

    (1) The recursion problem can be solved simply by having nullfs use
	ap->a_head.a_ops->vv_mount instead of ncp->nc_mount.  Recursion
	can thus be allowed (even if it does not make much sense).

    If we only make null_nresolve() do all the dirty work and then have it
    call a cache_shadow() function to indicate the relationship between the
    upper level namecache and the lower level namecache, we can implement
    all the remaining functionality in the cache_*() layer itself.

    (2) Do not have a shadow callback.  Instead have null_nresolve()
	call a new function, cache_shadow(), which creates the relationship
	between the upper and lower layer namecache records.  Only
	null_nresolve() would need the added sophistication.  The other
	null_n*() functions would simply remain what they were before.

    (3) Have the cache_*() API functions then detect whether a namecache
	record is part of a shadow set and 'do the right thing'.

    So, e.g. cache_lock() would lock down through the layers. 
    cache_rename() would rename through the layers, and so on and so
    forth.  This isn't just a matter of recursing down through the layers,
    but also of recursing up.  If someone does a rename in the actual 
    filesystem, we want the cache records for any nullfs mounts on top of
    that filesystem to be properly renamed or invalidated.  So:

    * This is one reason why we have to implement the functionality in
      cache_*().  Because cache_*() will be able to recurse up through
      higher layers as well as down through lower layers to 'do the right
      thing'.

    Ok, so now... why else do I want to do things this way?  Well, it goes
    back to the cache coherency layer I've been talking about.  Cache
    coherency is no trivial matter.  It can be hacked up if everything is
    local to a machine, but our goal is to become clusterable... for things
    to be clustered across multiple machines.  We may NOT want to keep
    around all the shadow namecache records.  In fact we may HAVE to
    re-resolve the lower layer on the fly depending on the circumstances.
    Because of this I would prefer that the VFS code (everything except
    VOP_NRESOLVE) *not* be aware of the relationship.

    --

    If you would like to take on this project (and it looks like you would!)
    then this is what I think needs to be done:

    (1) Do not have a shadowing VFS op (no vfs_getshadow).  

    (2) Have null_nresolve() make a shadowing API call called cache_shadow()
	which informs the cache_*() layer of the relationship between the
	upper and lower NCP's.

    (3) Have the cache_*() layer deal with shadow sets automatically
	(initially just doing what you are doing now in the cache_*() layer).

    (4) Do not resolve lower layers in the null_n*() functions (other then
	in null_nresolve()).  Leave the original code intact... that is,
	just change v_ops and pass down the original NCP's.  We will
	officially codify the method.

    (5) Do not have the nullfs layer attempt to do the recursion for
	cache_*() operations such as cache_rename() itself.  The 
	cache_*() layer will become responsible for dealing with the shadow
	chain.

        Then add code to the cache_*() layer to deal with e.g. 
	cache_rename() by recursing downwards through the shadow chain
	itself, rather then having the VFS layers do it.  (Remember, with
	these changes the upper (NULLFS) layer will be passed into UFS,
	so UFS will be calling cache_*() functions on the upper layer so
	the cache_*() functions need only recurse down).

    (6) [ we test and commit ]

    (7) This leaves us with just one issue... what happens when you
	do a rename (remove, create, etc) directly in the base filesystem?
	The cache_*() functions have to be able to recurse upwards as well
	as downwards to at the very least invalidate the upper (nullfs)
	layers so they don't get confused when things are ripped out from
	under them.

	This is the piece that will eventually require serious cache
	coherency management in a clustered system.  For now, it would just
	be done locally.

    (8) [ more testing and commits ]

						-Matt