namecache coherency, 2nd turn

Tue Feb 7 12:58:00 PST 2006

Hi,

Here is a revamp of the namecache coherency code (and nullfs).

I tried to follow the ideas we sort of agreed about in the thread "a take at
cache coherency", but real life wanted it differently sometimes...

 - Diffstats:
 kern/vfs_cache.c         |  439 +++++++++++++++++++++++++++++++++++++++++++----
 sys/namecache.h          |   14 +
 vfs/nullfs/null.h        |   17 +
 vfs/nullfs/null_vfsops.c |  193 ++++++++++----------
 vfs/nullfs/null_vnops.c  |  174 ++++++++++++++++--
 5 files changed, 683 insertions(+), 154 deletions(-)

  Ie., I'm not bothering with anything I shouldn't.

  See the patch itself at

  http://leaf.dragonflybsd.org/mailarchive/submit/2006-02/msg00021.html

 - Locking: the code implements the ideas Corecode and Matt were
   suggesting: groups are lockable via each member (O(1) locking); each
   namecache structure carries a fallback copy of locking related data,
   but it's a separate pointer which determines the actually used
   locking entities.

   I didn't introduce a separate "shadowinfo" structure type: I simply
   found no reason to partition the namecache structure, as it's still
   true that each namecache node carries exactly one copy of these. The
   only new field (apart from pointer/list related to shadowing)
   is the "nc_lockby" field which points to the namecache node via which
   the current node is lockable.

   Locking is MPSAFE as far as I'm concerned, that is, these changes
   don't introduce more dependency on external serialization. As I
   realized, Matt's nice "fallback copy" idea made it possible to
   migrate the "lockpick" of a given node between two other ones in a
   way so that it's not visible for lock contenders. (So we don't need
   those ugly interlocks, regardless of GIANT).

 - Group layout: I did *not* follow Matt's circular list idea. A group
   is still just a tree, with pointers toward parents (overlayed levels)
   and a list of kids (overlays). The basic kind of traversal I needed is
   traversing a subtree identified by a node (it's root), and I didn't
   find circular lists helpful in this respect. The group tree is
   traversed iteratively with the "moving edge" algorithm [ad hoc name]
   ie., the one which requires fixed storage (<previous node, actual node>)
   and goes twice over each node. (You'll see a bit of bloat: the
   traversal routine supports some features I didn't made use of finally.
   This can be trimmed if those extras keep being unused.)

 - Synchronization via groups: here is the greatest difference from what
   we agreed in (or my understanding went astray the furthest here).

   Basicly, the point is that groups are for synchronizing locks, and
   that negative events (more concretely, unresolution) are propagated
   upwards (and that's all: groups are used for nothing else).
   But within that, there are two alternatives:

   - Unstable groups: negative events blow up groups.

   - Stable groups: the namecache backend code *never* splits up groups
     (except at the end of a node's life cycle, during zapping), and each
     shadow association is expected to be torn at the layer where it was
     made.

   In http://leaf.dragonflybsd.org/mailarchive/kernel/2006-01/msg00117.html
   Matt says:

  % The shadow chain would NEVER be allowed to be stale.  It could be
  % partially unresolved, but the linkages cannot ever be stale (i.e.
  % cannot ever be incorrect).  This generally means that an unresolved
  % node needs to be disconnected from the chain.  The re-resolution of
  % the node would reconnect it.

   -- from which I conclude that he'd go for unstable groups. (Although it's
   true that he doesn't exactly specifies where should the disconnection
   happen, so the contrary might as well be true).

   I started going on to implement unstable groups, but ended up with
   stable groups, claiming that unstable groups are both unnecessary and
   are a dead end.

   - Unnecessary: shadow associatons can just be finely torn where they
     were created: in the resolver routine. Propagating unresolution will
     yield control back to the resolver routine, even if group topology is
     left intact; and then the resolver can start its work by tearing
     the link so that staleness (eg., in rename) will be avoided.

   - Dead end:

      ***
      Feel free to skip this if you find the "unnecessary"
      argument concvincing enough.
      ***

     When blowing up a group, each part will acquire a self contained
     locking state. It's the disconnector's responsibility to set those
     corretly. It can't divine the necessary state without hints.
     Therefore I didn't see other solution than keeping the local
     locking data of each node up to date (even if it was ignored by the
     cache_lock mechanism of groups), so that in case of the node
     becoming disconnected and self contained, it will be just-in-time
     equipped with the proper locking state.

     This impiled a very funky locking interface: eg, a typical
     null_nfoo() looked somehow like something like

	...
        ap->a_ncp = ncp->nc_shadowed;
	cache_lock(ap->a_ncp);
	vop_nfoo_ap(ap);
	cache_unlock(ap->a_ncp);

     -- here the { ncp, ncp->nc_shadowed, ... } group *is* locked at the
     beginning, but ncp->nc_shadowed has to get an extra "personal" lock
     because the group might go apart during the vop_nfoo_ap() call-down,
     and it has to have its personal lock state set as expected by the
     lower layer. Etc. (I don't go into details regarding possible solutions
     of problems spawning at this point...)

     And there is another reason for falling into more heavy locking: the
     unstable model means that we usually can't derefer an nc_shadowed
     member without locking, because someone else might destroy it any
     time. With the stable model, the setter of nc_shadowed can have a
     more reliable preconception regarding the contents of this field
     (because she will be the unsetter as well).

     All in all, the unstable approach ended up in a programming model I
     couldn't make consistent. Each fixed panic resulted in seven new
     one, like when fighting the hydra :) With the stable approach,
     post-split state synchronization is always a local issue which
     can be handled trivially. With that, causes of panics were just
     slightly more than typos.

   - Deadlocks. I thought over Corecode's concerns regarding deadlock
     situations and I think I succeeded to write a deadlock safe "shadow
     attach" routine.

   - Renaming. I rewrote cache_rename() for two reasons:

      - I felt it's not deadlock safe: cache_inval(tncp) is called
        with fncp locked, and cache_inval does many locking on and off.
	I didn't see any constraint which could make us relaxed wrt.
	a deadlock on eg. fncp and some child of tncp... maybe it's just my
	short-sightedness.

      - When children are migrated, their overlays has to get
        unresolved. This requires them got locked (unlike migration in
        itself). So the code has to be rewritten in a way so that each
	child can get its fifteen minutes fame.

   - Nullfs. Except for null_resolve(), each method is still just a simple
     delegation downwards. Multilayer nullfs mounts and renaming seem to work
     fine. You can still get a crash by unmounting midlayers.

I'm a bit confused about the purpose of kernel@ and submit@, is it OK as
I tend to do, discussion on kernel@, code on submit@ ? Or should I keep
discussion on submit@ as well? 

Csaba