a take at cache coherency

Mon Jan 23 03:08:01 PST 2006

Hi,

I sent a patch to submit@ implementing a kind of cache coherency:

http://leaf.dragonflybsd.org/mailarchive/submit/2006-01/msg00034.html

Pros:
  Lightweight, in particular

  - Unidirectonal synchronization. Only upper layers keep a reference to
    lower ones, a lower ("shadowed") layer doesn't need to be aware of
    being overlayed.

  - No extra locking is used, rather the semantic of cache locking is
    tweaked a bit.

Cons:

  - Namecache API got messed up a bit, too. Some attributes can't be
    accessed directly: you can't use "ncp->nc_vp", you need to do
    "cache_grphead(ncp)->nc_vp". Eventually some kind of wrapper should
    be made around namecache entries, so that you could write eg.
    "NC_VP(ncp)". At this stage I didn't do this.

How:

  - The basic semantic unit of the cache layer was a namecache entry.
    Now there are two basic units: namecache entry and shadow group (of
    namecache entries).

  - Some attributes of namecache entries were transformed into shadow
    group attributes: these are

      nc_vp, nc_error, nc_timeout, nc_exlocks, nc_locktd, and some of
      the flags:
        NCF_LOCKED, NCF_WHITEOUT, NCF_UNRESOLVED, NCF_ISSYMLINK, NCF_ISDIR,
        NCF_ISDESTROYED
      -- ie, attributes which refer the underlying vnode, and locking.

    The rest remained entry attributes -- most notably name and ancestral
    relations.

    Hence there is no wondering like "umm, I locked/unlocked/resolved
    the overlaying/overlayed namecache entry, now how to ensure that the
    one under/over is in a similar state?" You lock, unlock, resolve
    shadow groups, not namecache entries, period.

  - Shadow group are abstract things: there is no dedicated "struct xyz"
    for representing them. Instead, namecache entries got one additional
    field, nc_shadowed. Shadow groups are the connected components of
    the nc_shadowed graph (the one spawned by <ncp, ncp->nc_shadowed>
    edges). Each such group is a tree, so they can be represented via
    the head (root, tip, node with no outgoing edge) of the tree. Hence
    group attributes are kept by the head of the tree. For each
    namecache entry, the cache_grphead(ncp) and cache_grphead_l(ncp)
    functions return the head of the shadow group ncp belongs to (the
    latter expects the group of ncp being locked). So read the above
    form "cache_grphead(ncp)->nc_vp" as "get the associated vnode of the
    shadow group of ncp".

  - API was extended with the cache_grphead(ncp), cache_grphead_l(ncp),
    cache_shadow_attach(ncp, sncp), cache_shadow_detach(ncp) functions.  First
    two has already been explained, latter two have self explanatory names.
    (Eventually, cache_grphead_l might get ditched.)

    For those functions where that's appropriate, the operation of the
    function has been shifted to shadow group context. That is, you can keep on
    doing "cache_setvp(ncp, vp)", and you *do not* have to type
    "cache_setvp(cache_grphead(ncp), vp)". However, it's not ncp, but ncp's
    shadow group who gets the vp, so making assertions like 
    "cache_setvp(ncp, vp); KKASSERT(ncp->nc_vp == vp)" would be a bad
    idea in general. That is, direct attribute access has been broken,
    other things work as they are accustomed to.

  - How locking goes. Two kind of locking mechanisms are present the
    code, but both utilize the same namecache attributes (nc_exlocks,
    ...). One is the usual namecache lock, to which the interface
    consists of the cache_[un]lock() functions. As I told, these now
    operate in shadow group context. Other is the namecache entry
    interlock, used for ensuring consistency when moving within a shadow
    group (with alteration or locking intent). For these the interface
    consists of cache_[un]lock_one(), but that's kept private (static)
    (user doesn't have to care about interlocks).

    In fact, cache_[un]lock_one are nothing else but the old
    cache_[un]lock routines renamed thusly. The new cache_lock(ncp) does
    the following:

      * cache_lock_one(ncp).
      * if ncp is a group head (doesn't have shadow association), done.
      * else derefer ncp->nc_shadowed, cache_unlock_one(ncp) and start all
        over again with ncp->nc_shadowed.

  - nullfs as the posterboy shadow group application. nullfs has been rewritten
    to utilize shadow groups.

     * nullfs was made more mean and lean: no private mount data is kept
       around, our starting point is now mnt_ncp.

     * nullfs passes through lower mount points. Eventually, this behaviour
       should be tunable by a mount option.

     [Footnote: the above two points mean that despite Matt's remarks in
     http://leaf.dragonflybsd.org/mailarchive/kernel/2006-01/msg00023.html,
     nullfs keeps being a happy user of ncp->nc_mount.]

     * Resolution means creating shadow associations (apart from
       delegetion to lower layer). Argument ncp will shadow the
       similarly named child of its parent's shadowed namecache entry.

     * Mostly, nullfs just delegates downwards. Apart from that and making
       shadow attachments, there is one more activity sported by nullfs:
       managing ancestry relations, as those are not subject to shadowing.
       This affects one fs method: nrename.

     * There were two problems with nullfs (cf. corecode's stabilization work,
       http://leaf.dragonflybsd.org/mailarchive/kernel/2006-01/msg00018.html).

        - Recursed loops. When you stack null mounts, the lower mount has the
          same op vector, therefore simply changing the op vector to that of
          the lower mount will recurse into the given method until kernel stack
          gets exhausted.

          This is avoided by stepping down in the shadow chain:

           ap->a_ncp = ap->a_ncp->nc_shadowed

	  This step-down has no other use or effect than avoiding recursion, as
	  we remain in the same shadow group, and n* fs methods affect shadow
	  group attributes (with the notable exception of nrename [where the
	  step-down also means "hey lower layer, readjust your ancestry
	  relations"]).

        - Cache incoherency. That's what the shadow group thingy addresses at
          first place. The flipflop test
          (http://leaf.dragonflybsd.org/mailarchive/kernel/2006-01/msg00008.html)
          is got right. 

    * Unsolved issues. It's not handled properly when the lower layer
      gets unmounted. There are two possible behaviours:

        - keep lower fs busy, don't let it go.
	- let lower fs go; when it's gone, return ENXIO from calls.

      Taking stickiness (pass though lower mounts or not) into account,
      we have two by two, that is, four choices. Only the "fixed lower
      fs, don't let it go" combination seems to have an unambigous
      implementataion: vfs_busy the lower fs on null mount, vfs_unbusy
      on null unmount.

      I don't know how would it be the best to deal with the rest, so I
      didn't do anything about it, except for a bit of duct-tape:
      null_nresolve checks for non-nullness of lower mp and its op vec.
      That seems to be good enough to aviod simple panic scenarios, but
      doesn't save us from deadlocking the fs.

      I plea for ideas.

Regards,
Csaba