a take at cache coherency
csaba.henk at creo.hu
Mon Jan 23 03:08:01 PST 2006
I sent a patch to submit@ implementing a kind of cache coherency:
Lightweight, in particular
- Unidirectonal synchronization. Only upper layers keep a reference to
lower ones, a lower ("shadowed") layer doesn't need to be aware of
- No extra locking is used, rather the semantic of cache locking is
tweaked a bit.
- Namecache API got messed up a bit, too. Some attributes can't be
accessed directly: you can't use "ncp->nc_vp", you need to do
"cache_grphead(ncp)->nc_vp". Eventually some kind of wrapper should
be made around namecache entries, so that you could write eg.
"NC_VP(ncp)". At this stage I didn't do this.
- The basic semantic unit of the cache layer was a namecache entry.
Now there are two basic units: namecache entry and shadow group (of
- Some attributes of namecache entries were transformed into shadow
group attributes: these are
nc_vp, nc_error, nc_timeout, nc_exlocks, nc_locktd, and some of
NCF_LOCKED, NCF_WHITEOUT, NCF_UNRESOLVED, NCF_ISSYMLINK, NCF_ISDIR,
-- ie, attributes which refer the underlying vnode, and locking.
The rest remained entry attributes -- most notably name and ancestral
Hence there is no wondering like "umm, I locked/unlocked/resolved
the overlaying/overlayed namecache entry, now how to ensure that the
one under/over is in a similar state?" You lock, unlock, resolve
shadow groups, not namecache entries, period.
- Shadow group are abstract things: there is no dedicated "struct xyz"
for representing them. Instead, namecache entries got one additional
field, nc_shadowed. Shadow groups are the connected components of
the nc_shadowed graph (the one spawned by <ncp, ncp->nc_shadowed>
edges). Each such group is a tree, so they can be represented via
the head (root, tip, node with no outgoing edge) of the tree. Hence
group attributes are kept by the head of the tree. For each
namecache entry, the cache_grphead(ncp) and cache_grphead_l(ncp)
functions return the head of the shadow group ncp belongs to (the
latter expects the group of ncp being locked). So read the above
form "cache_grphead(ncp)->nc_vp" as "get the associated vnode of the
shadow group of ncp".
- API was extended with the cache_grphead(ncp), cache_grphead_l(ncp),
cache_shadow_attach(ncp, sncp), cache_shadow_detach(ncp) functions. First
two has already been explained, latter two have self explanatory names.
(Eventually, cache_grphead_l might get ditched.)
For those functions where that's appropriate, the operation of the
function has been shifted to shadow group context. That is, you can keep on
doing "cache_setvp(ncp, vp)", and you *do not* have to type
"cache_setvp(cache_grphead(ncp), vp)". However, it's not ncp, but ncp's
shadow group who gets the vp, so making assertions like
"cache_setvp(ncp, vp); KKASSERT(ncp->nc_vp == vp)" would be a bad
idea in general. That is, direct attribute access has been broken,
other things work as they are accustomed to.
- How locking goes. Two kind of locking mechanisms are present the
code, but both utilize the same namecache attributes (nc_exlocks,
...). One is the usual namecache lock, to which the interface
consists of the cache_[un]lock() functions. As I told, these now
operate in shadow group context. Other is the namecache entry
interlock, used for ensuring consistency when moving within a shadow
group (with alteration or locking intent). For these the interface
consists of cache_[un]lock_one(), but that's kept private (static)
(user doesn't have to care about interlocks).
In fact, cache_[un]lock_one are nothing else but the old
cache_[un]lock routines renamed thusly. The new cache_lock(ncp) does
* if ncp is a group head (doesn't have shadow association), done.
* else derefer ncp->nc_shadowed, cache_unlock_one(ncp) and start all
over again with ncp->nc_shadowed.
- nullfs as the posterboy shadow group application. nullfs has been rewritten
to utilize shadow groups.
* nullfs was made more mean and lean: no private mount data is kept
around, our starting point is now mnt_ncp.
* nullfs passes through lower mount points. Eventually, this behaviour
should be tunable by a mount option.
[Footnote: the above two points mean that despite Matt's remarks in
nullfs keeps being a happy user of ncp->nc_mount.]
* Resolution means creating shadow associations (apart from
delegetion to lower layer). Argument ncp will shadow the
similarly named child of its parent's shadowed namecache entry.
* Mostly, nullfs just delegates downwards. Apart from that and making
shadow attachments, there is one more activity sported by nullfs:
managing ancestry relations, as those are not subject to shadowing.
This affects one fs method: nrename.
* There were two problems with nullfs (cf. corecode's stabilization work,
- Recursed loops. When you stack null mounts, the lower mount has the
same op vector, therefore simply changing the op vector to that of
the lower mount will recurse into the given method until kernel stack
This is avoided by stepping down in the shadow chain:
ap->a_ncp = ap->a_ncp->nc_shadowed
This step-down has no other use or effect than avoiding recursion, as
we remain in the same shadow group, and n* fs methods affect shadow
group attributes (with the notable exception of nrename [where the
step-down also means "hey lower layer, readjust your ancestry
- Cache incoherency. That's what the shadow group thingy addresses at
first place. The flipflop test
is got right.
* Unsolved issues. It's not handled properly when the lower layer
gets unmounted. There are two possible behaviours:
- keep lower fs busy, don't let it go.
- let lower fs go; when it's gone, return ENXIO from calls.
Taking stickiness (pass though lower mounts or not) into account,
we have two by two, that is, four choices. Only the "fixed lower
fs, don't let it go" combination seems to have an unambigous
implementataion: vfs_busy the lower fs on null mount, vfs_unbusy
on null unmount.
I don't know how would it be the best to deal with the rest, so I
didn't do anything about it, except for a bit of duct-tape:
null_nresolve checks for non-nullness of lower mp and its op vec.
That seems to be good enough to aviod simple panic scenarios, but
doesn't save us from deadlocking the fs.
I plea for ideas.
More information about the Kernel