nullfs stabilization I
dillon at apollo.backplane.com
Tue Jan 10 10:56:15 PST 2006
:I know Matt wants a generic cache coherency layer, but until this is in
:place, I propose attached patch to fix this
:rename-race-and-file-vanishes bug in nullfs. Furthermore, you can't
:mount a nullfs from a nullfs as this will lead to a recursion stack
:overflow double fault. This has been fixed as well.
:I'm not sure if this single linked list is actually enough, because
:renames on the lower layer won't trigger cache_inval's on the upper
:layer at the moment.
:What do people think?
Wow. Ok. Well, I don't want to dampen your wonderful coding because
your patch is really quite nice! *BUT* while I like the concept, I
want to try to have as much of the sophistication in the cache_*() layer
as possible and as little in the NULLFS layer as possible. I'll
explain what I'm thinking of, and why I want it this way, below:
The pre-patch NULLFS code simply changes a_ops. Your new patch is
arguably more correct in that it also passes the correct NCP's to
the lower layer(s) (that is, the NCP's representative of the lower
layer's mount point), but in order to do so you are implementing
the fine details in all the NULLFS VOP routines. I think we need to
What I would like to do is implement all that logic in the
cache_*() code itself. I *want* the nullfs layer to be able to
simply change ap->a_head.a_ops and recurse down. I don't want the
nullfs layer to have to determine and pass the lower layer NCP's to
the lower layer.
I know this sounds a bit wrong, because then the ncp->nc_mount field
will be 'wrong' from the point of view of the lower layer, but in
fact lower layers do *NOT* use nc_mount. It's a bug that nullfs does
(or we can start calling it a bug :-)).
The way you get to the mount point is via ap->a_head.a_ops->vv_mount.
The ncp->nc_mount field was never meant to be used by the VFS code...
only to be used internally by cache_*(). It looks like I broke my own
rule... I have two references in NFS, but that's for later. So this
(1) The recursion problem can be solved simply by having nullfs use
ap->a_head.a_ops->vv_mount instead of ncp->nc_mount. Recursion
can thus be allowed (even if it does not make much sense).
If we only make null_nresolve() do all the dirty work and then have it
call a cache_shadow() function to indicate the relationship between the
upper level namecache and the lower level namecache, we can implement
all the remaining functionality in the cache_*() layer itself.
(2) Do not have a shadow callback. Instead have null_nresolve()
call a new function, cache_shadow(), which creates the relationship
between the upper and lower layer namecache records. Only
null_nresolve() would need the added sophistication. The other
null_n*() functions would simply remain what they were before.
(3) Have the cache_*() API functions then detect whether a namecache
record is part of a shadow set and 'do the right thing'.
So, e.g. cache_lock() would lock down through the layers.
cache_rename() would rename through the layers, and so on and so
forth. This isn't just a matter of recursing down through the layers,
but also of recursing up. If someone does a rename in the actual
filesystem, we want the cache records for any nullfs mounts on top of
that filesystem to be properly renamed or invalidated. So:
* This is one reason why we have to implement the functionality in
cache_*(). Because cache_*() will be able to recurse up through
higher layers as well as down through lower layers to 'do the right
Ok, so now... why else do I want to do things this way? Well, it goes
back to the cache coherency layer I've been talking about. Cache
coherency is no trivial matter. It can be hacked up if everything is
local to a machine, but our goal is to become clusterable... for things
to be clustered across multiple machines. We may NOT want to keep
around all the shadow namecache records. In fact we may HAVE to
re-resolve the lower layer on the fly depending on the circumstances.
Because of this I would prefer that the VFS code (everything except
VOP_NRESOLVE) *not* be aware of the relationship.
If you would like to take on this project (and it looks like you would!)
then this is what I think needs to be done:
(1) Do not have a shadowing VFS op (no vfs_getshadow).
(2) Have null_nresolve() make a shadowing API call called cache_shadow()
which informs the cache_*() layer of the relationship between the
upper and lower NCP's.
(3) Have the cache_*() layer deal with shadow sets automatically
(initially just doing what you are doing now in the cache_*() layer).
(4) Do not resolve lower layers in the null_n*() functions (other then
in null_nresolve()). Leave the original code intact... that is,
just change v_ops and pass down the original NCP's. We will
officially codify the method.
(5) Do not have the nullfs layer attempt to do the recursion for
cache_*() operations such as cache_rename() itself. The
cache_*() layer will become responsible for dealing with the shadow
Then add code to the cache_*() layer to deal with e.g.
cache_rename() by recursing downwards through the shadow chain
itself, rather then having the VFS layers do it. (Remember, with
these changes the upper (NULLFS) layer will be passed into UFS,
so UFS will be calling cache_*() functions on the upper layer so
the cache_*() functions need only recurse down).
(6) [ we test and commit ]
(7) This leaves us with just one issue... what happens when you
do a rename (remove, create, etc) directly in the base filesystem?
The cache_*() functions have to be able to recurse upwards as well
as downwards to at the very least invalidate the upper (nullfs)
layers so they don't get confused when things are ripped out from
This is the piece that will eventually require serious cache
coherency management in a clustered system. For now, it would just
be done locally.
(8) [ more testing and commits ]
More information about the Kernel