More thoughts on tokens
Matthew Dillon
dillon at apollo.backplane.com
Mon Feb 23 11:51:56 PST 2004
I've been doing a comprehensive review of how LK_NOWAIT is used with
locks on DragonFly. It turns out that it is only used in about two
dozen files and most of it is related to buffer cache operation
(softupdates being the most aggregious misuser).
The interlock/race issue with most of the uses of LK_NOWAIT is that
code which uses LK_NOWAIT does not expect to be switched out to
another thread during, e.g. a lockmgr() operation, and so all the
code in question seems to make assumptions in regards to the stability
of the structure it is working on. Here's an example from the buffer
cache code:
if ((bp = gbincore(vp, blkno))) {
/*
* Buffer is in-core. If the buffer is not busy, it must
* be on a queue.
*/
if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_NOWAIT)) {
if (BUF_TIMELOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL,
"getblk", slpflag, slptimeo) == ENOLCK)
goto loop;
splx(s);
return (struct buf *) NULL;
}
/*
* The buffer is locked. B_CACHE is cleared if the buffer is
*...
...
gbincore() is not locking the returned buffer. The code above depends
on code atomicy (i.e. the Big Giant Lock being held and no thread switches
occuring) between the call to gbincore() and the call to
BUF_LOCK(...,LK_NOWAIT). This is a good example of why I have to have
the token code spin for the moment instead of yield.
The solution is equally obvious... and that is to pass a token
reference structure to [gb]incore() and for [gb]incore() to return the
buffer with its token held. e.g.
lwkt_tokref ilock;
if ((bp = gbincore(vp, &ilock, blkno))) {
...
if (BUF_LOCK(bp, &ilock, LK_EXCLUSIVE | LK_NOWAIT)) {
...
}
This way the token code would be able to yield without breaking the
atomicy between the gbincore() call and the BUF_LOCK() call above.
[ If you love FreeBSD-5's mutexes, stop reading here ]
In contrast, FreeBSD-5's mutex solution is as follows:
VI_LOCK(vp);
if ((bp = gbincore(vp, blkno))) {
int lockflags;
/*
* Buffer is in-core. If the buffer is not busy, it must
* be on a queue.
*/
lockflags = LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK;
if (flags & GB_LOCK_NOWAIT)
lockflags |= LK_NOWAIT;
error = BUF_TIMELOCK(bp, lockflags,
VI_MTX(vp), "getblk", slpflag, slptimeo);
...
}
This is a pretty good example of how NOT to implement an interlock API,
so I'll rail on FreeBSD for a moment here:
* All users of [gb]incore() have to explicitly obtain the vnode's
mutex as an interlock.
In my version, the interlock users of [gb]incore() obtain is under
the control of [gb]incore(), which gives us the flexibility to
implement it either as a vnode-embedded token or as a
buffer-embedded token without having to pollute the higher level
code with the knowledge of which one.
* The call obtaining the interlock has been separated out from
gbincore() and the call passing the interlock to BUF_TIMELOCK()
is not directly linked to the VI_LOCK() call that obtained it.
In my version gbincore() loads the supplied interlock reference
structure and this same structure is required to be passed to
BUF_*(), so there is no knowledge pollution in the higher level
code at all as to what the interlock actually is.
In anycase, I know why FreeBSD has done things this way... it's because
of the mutex model they are using which I hate so much. Since lock
ordering is an issue with mutexes most of FreeBSD's codebase needs to
have very specific knowledge of the exact mutex(es) being manipulated
throughout the entire call stack in order to avoid lock order reversals.
Also FreeBSD mutexes have to be released across blocking situations,
and FreeBSD mutexes cannot be used recursively (at least not safely).
This means that lower level procedures either must have complete
knowledge about the mutex(es) held by higher level procedures that call
into them, or must be guarenteed to be non-blocking. And, worse, when
you *DO* obtain multiple mutexes in the FreeBSD-5 model a deep yield
makes it impossible for any other thread to obtain the higher level
mutexes. In fact, for all intents and purposes a mutex in FreeBSD
is almost exactly the same as a normal lock in its effect.
We do not have that problem. Our tokens do not have any (theoretical)
lock order reversal issues (theoretical because my initial
implementation can't guarentee that a livelock won't happen), our
tokens can be held across blocking conditions, multiple procedural
levels can obtain the same token independantly (recursively) without
specific knowledge of who is holding what, and a deep blocking or yield
situation in DFly does not prevent other threads from obtaining the
same serializing tokens while the first one is blocked because our
tokens are run-time serializing entities only, not lockmgr()-style
(run-time and block-time serializing) locks.
So, I really do believe that our serializing token model is the better
way to go. At the very least, the use of a reference structure is
a far superior abstraction then the direct use of tokens or mutexes
(what I had before and what FBsd has now).
It's a bit sad that FreeBSD has tied itself into such knots. With some
work they could change their mutex model into something similar to our
serializing token model simply by recording the held mutexes and
then allowing them to be held through a blocking condition (releasing
them on switch-out and re-acquiring them on switch-in), along with
getting rid of that utterly insane preemptive kernel thread switching
model and preemptive cpu switching model. Just changing the internal
mechanism is not enough, though, because they've already made major
API changes that pretty much lock them into the current model. They
would have to first admit that they have put themselves into a corner,
then fix the API abstractions, before they can actually fix the core
mutex abstraction.
-Matt
More information about the Kernel
mailing list