More thoughts on tokens

Mon Feb 23 11:51:56 PST 2004

    I've been doing a comprehensive review of how LK_NOWAIT is used with
    locks on DragonFly.  It turns out that it is only used in about two
    dozen files and most of it is related to buffer cache operation
    (softupdates being the most aggregious misuser).

    The interlock/race issue with most of the uses of LK_NOWAIT is that
    code which uses LK_NOWAIT does not expect to be switched out to
    another thread during, e.g. a lockmgr() operation, and so all the
    code in question seems to make assumptions in regards to the stability
    of the structure it is working on.  Here's an example from the buffer
    cache code:


        if ((bp = gbincore(vp, blkno))) {
                /*
                 * Buffer is in-core.  If the buffer is not busy, it must
                 * be on a queue.
                 */

                if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT)) {
                        if (BUF_TIMELOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL,
                            "getblk", slpflag, slptimeo) == ENOLCK)
                                goto loop;
                        splx(s);
                        return (struct buf *) NULL;
                }

                /*
                 * The buffer is locked.  B_CACHE is cleared if the buffer is
		 *...
	...

    gbincore() is not locking the returned buffer.  The code above depends
    on code atomicy (i.e. the Big Giant Lock being held and no thread switches
    occuring) between the call to gbincore() and the call to 
    BUF_LOCK(...,LK_NOWAIT).  This is a good example of why I have to have
    the token code spin for the moment instead of yield.  

    The solution is equally obvious... and that is to pass a token
    reference structure to [gb]incore() and for [gb]incore() to return the
    buffer with its token held.  e.g.

	lwkt_tokref	ilock;

        if ((bp = gbincore(vp, &ilock, blkno))) {
		...
                if (BUF_LOCK(bp, &ilock, LK_EXCLUSIVE | LK_NOWAIT)) {
		...
	}

    This way the token code would be able to yield without breaking the
    atomicy between the gbincore() call and the BUF_LOCK() call above.


	    [ If you love FreeBSD-5's mutexes, stop reading here ]

    In contrast, FreeBSD-5's mutex solution is as follows:

	VI_LOCK(vp);
	if ((bp = gbincore(vp, blkno))) {
                int lockflags;
                /*
                 * Buffer is in-core.  If the buffer is not busy, it must
                 * be on a queue.
                 */
                lockflags = LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK;

                if (flags & GB_LOCK_NOWAIT)
                        lockflags |= LK_NOWAIT;

                error = BUF_TIMELOCK(bp, lockflags,
                    VI_MTX(vp), "getblk", slpflag, slptimeo);
		...
	}

    This is a pretty good example of how NOT to implement an interlock API,
    so I'll rail on FreeBSD for a moment here:

	* All users of [gb]incore() have to explicitly obtain the vnode's 
	  mutex as an interlock.

	  In my version, the interlock users of [gb]incore() obtain is under
	  the control of [gb]incore(), which gives us the flexibility to
	  implement it either as a vnode-embedded token or as a 
	  buffer-embedded token without having to pollute the higher level
	  code with the knowledge of which one.

	* The call obtaining the interlock has been separated out from 
	  gbincore() and the call passing the interlock to BUF_TIMELOCK()
	  is not directly linked to the VI_LOCK() call that obtained it.

	  In my version gbincore() loads the supplied interlock reference
	  structure and this same structure is required to be passed to
	  BUF_*(), so there is no knowledge pollution in the higher level
	  code at all as to what the interlock actually is.

    In anycase, I know why FreeBSD has done things this way... it's because
    of the mutex model they are using which I hate so much.  Since lock
    ordering is an issue with mutexes most of FreeBSD's codebase needs to
    have very specific knowledge of the exact mutex(es) being manipulated
    throughout the entire call stack in order to avoid lock order reversals.
    Also FreeBSD mutexes have to be released across blocking situations,
    and FreeBSD mutexes cannot be used recursively (at least not safely).
    This means that lower level procedures either must have complete 
    knowledge about the mutex(es) held by higher level procedures that call
    into them, or must be guarenteed to be non-blocking.  And, worse, when
    you *DO* obtain multiple mutexes in the FreeBSD-5 model a deep yield
    makes it impossible for any other thread to obtain the higher level
    mutexes.  In fact, for all intents and purposes a mutex in FreeBSD
    is almost exactly the same as a normal lock in its effect.

    We do not have that problem.  Our tokens do not have any (theoretical)
    lock order reversal issues (theoretical because my initial
    implementation can't guarentee that a livelock won't happen), our
    tokens can be held across blocking conditions, multiple procedural
    levels can obtain the same token independantly (recursively) without
    specific knowledge of who is holding what, and a deep blocking or yield
    situation in DFly does not prevent other threads from obtaining the
    same serializing tokens while the first one is blocked because our
    tokens are run-time serializing entities only, not lockmgr()-style
    (run-time and block-time serializing) locks.

    So, I really do believe that our serializing token model is the better
    way to go.  At the very least, the use of a reference structure is
    a far superior abstraction then the direct use of tokens or mutexes
    (what I had before and what FBsd has now).

    It's a bit sad that FreeBSD has tied itself into such knots.  With some
    work they could change their mutex model into something similar to our
    serializing token model simply by recording the held mutexes and
    then allowing them to be held through a blocking condition (releasing
    them on switch-out and re-acquiring them on switch-in), along with
    getting rid of that utterly insane preemptive kernel thread switching
    model and preemptive cpu switching model.  Just changing the internal
    mechanism is not enough, though, because they've already made major
    API changes that pretty much lock them into the current model.  They
    would have to first admit that they have put themselves into a corner,
    then fix the API abstractions, before they can actually fix the core
    mutex abstraction.

					-Matt