lkwt in DragonFly
Matthew Dillon
dillon at apollo.backplane.com
Sat Feb 7 18:04:14 PST 2004
: How do you determine whether another CPU holds the token when your
: thread on the current cpu wants/needs it? Or do you always
: synchronously IPI the other cpu when you want a token for writing?
The cpu owning the token is stored in the token structure. No bus locked
instruction is required to test (tok->gd == mycpu) because only we can
give a token away to another cpu, which synchronously updates our cpu's
cache to set tok->gd != mycpu, and only the cpu owning the token can
hand it off to us, which means that the contents of tok->gd will only
be read back as == mycpu if another cpu has handed it off to us, and
once another cpu has done that only our cpu can give it up.
struct lwkt_token {
globaldata_t gd; /* cpu owning token */
...
}
So, obtaining a token in the cache case could be as simple as this:
lwkt_gettoken(lwkt_token_t tok)
{
globaldata_t gd = mycpu;
if (gd->gd_cachetok)
/* deal with previously fast-cached tokens */
gd->gd_cachetok = tok; /* interlock against IPI */
if (tok->gd == gd) /* optimal case, done */
return;
... fall through to less optimal code ...
... leave gd_cachetok set to tok indicating that we want the token ...
send_ipi requesting to tok->gd requesting 'tok'.
...
}
lwkt_reltoken(lwkt_token_t tok)
{
globaldata_t gd = mycpu;
if (gd->gd_cachetok == tok) {
gd->gd_cachetok = NULL;
return;
}
... less optimal code. ..
}
Now I am not enitrely sure what the %fs:gd overhead is to load
'mycpu', but ignoring that issue for the moment the rest of the
code will execute optimally within the processor's pipeline.
(Also, in DFly since threads cannot be preemptively switched to
another cpu, the contents of 'mycpu' is effectively fixed and
can be cached by a procedure).
The key issue here is that the optimal case becomes nothing more
complex then a few lines of standard C code, easily inlined and
easily pipelined by the cpu.
Another key performance issue is the fact that since the tokens do
not actually create deadlockable situations, a great deal of code
where you might be obtaining and releasing several mutexes in FBsd-5
should devolve into just a single token in DFly. In particular I am
thinking of cases within the lockmgr, cases where a mutex must be
passed as an argument in FBsd-5 so it can be released and reacquired
in a deeper procedure that blocks, and cases where you have to play pussy
foot with mutexes in FBsd-5 to deal with lock ordering issues (e.g.
unlock B, lock A, relock B). I think that if you were to compare the
overhead of a single mutex to a single token that you would not see much
of a difference.... but the idea in DFly is to greatly reduce the number
of operations that would otherwise have to occur anyway so it would be
a more fair comparison to compare two or three mutexes against a single
token.
(ultimately, of course, only a real live test can tell whether there
is an actual performance benefit).
: If you always IPI for a write, then all accesses to data structures
: protected by tokens need to be classified into either reads or
: writes.
:
: What about if you have multiple CPUs holding a token for reading only
: and then one comes along and wants to have the token for writing?
: What exactly happens? Do you have a way of figuring out whether
: another CPU holds the token for reading and an IPI is required, or do
: you always IPI everyone?
Ok, so the question is how do we detect when a token can or cannot be
given away to another cpu? It turns out to be fairly simple for the
exclusive case: since only one cpu can 'own' the token, if you
are on that cpu you basically do not have to do anything because you
already own the token (remember, the token is not a lock, only a
serializing entity). If you are on a different cpu you send an IPI
request to the cpu owning the token (tok->cpu). The IPI code on the
target cpu simply checks to see if the token is being held by the
currently running thread. There are two ways of doing this: First,
if we take the simple case where a thread only owns one token,
the token may be held in gd->gd_cachetok. Otherwise the token would
be held in the current thread's token array:
/*
* IPI callback (basically like a FAST int. entered with us in a
* critiacl section):
*/
get_token_remote(tokreq_t req)
{
globaldata_t gd = mycpu;
token_t tok = req->tok;
if (tok->cpu == gd) {
if (gd->gd_cachetok != tok)
tok->cpu = req->cpu;
} else {
send_ipi(tok->cpu, get_token_remote, req); /* message chasing */
}
}
Now,that's a simplification. If a thread can own multiple tokens
then we would store them in an array in the thread and we would have
to iterate through the array... but since a thread will either own
just zero or one token 'most of the time', the performance case is
the zero or one token case, not the multiple token case.
MULTIPLE READER TOKEN CASE
Ok, now multiple-reader tokens (RCU style serialization). I think this
is a situation where we would need to have a cpu mask embedded in the
the token structure and then use locked bus instructions to set and clear
bits in the mask. However, there is no requirement that we actually
clear the bit in the mask when we release the token so the performance
case for RCU would be able to run without any locked bus cycles. The
trade-off is that if another cpu wants to get the token *exclusiveily*
(exclusive serialization) it would have to send an IPI to all the cpus
in the mask which might include cpus that do not actually own the token.
In fact, we definitely do not want to clear the cpu bitmask bit in the
token when releasing it, at least not immediately, because that would
greatly reduce thread switching performance (the LWKT scheduler would
have to run through the tokens owned by the old thread and release their
cpu bits, then set the cpu bits on the tokens owned by the new thread).
So:
lwkt_gettoken_shared(lwkt_token_t tok)
{
globaldata_t gd = mycpu;
gd->gd_cachetok = tok; /* interlock against IPI on current cpu */
if ((tok->cpumask & (1 << gd->gd_cpuid)) == 0)
atomic_bus_locked_bit_set(&tok->cpumask, ...);
/* owned by current cpu (exclusively or shared), we are done */
if (tok->cpu == gd)
return;
/* owned by another cpu, do more work to see if it is exclusively
owned by the target cpu, perhaps by sending an IPI */
...
}
Remember that for RCU operation obtaining 'read' access is either
inherent or very cheap, and obtaining write access is expensive.
: So this means you will have locked bus cycles for token acquisitions
: UNLESS the token was last held by the current CPU? How do you
: determine whether the token is cached locally? A critical section
: protecting the check of a flag so as to ensure an incoming IPI does
: not preempt your check?
This can be done by interlocking with a globaldata field, without
using a locked bus cycle or a critical section.
globaldata_t gd = mycpu;
gd->gd_cachetok = tok; /* prevent IPI from giving token away */
if (tok->cpu == gd) /* THEN check if our cpu owns the token */
return; /* yes, done */
/* no, request token via (expensive) IPI */
The IPI code runs on the target cpu, so if we own the token already
then we need only set gd_cachetok and any IPI sent to us will check it
and refuse to give the token away to another cpu.
For the multiple token case the IPI code can for (i = 0;i < td->td_numtoks;
++i) if (td->td_tokary[i] == tok) return; /* cannot give token away */.
(td == current thread)
This sort of interlock works because all the potentially conflicting
operations are running on the cpu owning the token and ONLY the cpu
owning the token, so no locked bus cycles are required and only a simple
interlock is needed to deal with IPI interrupts interrupting our code
just as we are getting or releasing a token.
If the IPI request cannot be satisfied the IPI code can queue the request
to the token:
req->next = tok->reqbase;
tok->reqbase = req;
And then the release code can check for tok->reqbase != NULL and
satisfy the pending request(s) (or next pending request). Pulling
a request off the list would have to be done inside a critical section
since an IPI can add more requests to the list, but this is not a
performance critical case.
: costs for acquiring a token for reading and writing will be? In most
: cases, from what you said, I'm expecting a typical acquisition to
: consist of a critical section at best (for a read or if the token is
: cached on the local cpu), sometimes a bus-locked instruction, and
: sometimes IPIs? Pardon the ignorance, but not yet having looked at
: your implementation, it is still unclear to me as to what a typical
: token acquisition/deacquisition will look like in DragonFly.
:
:Cheers,
:--
:Bosko Milekic * bmilekic at xxxxxxxxxxxxxxxx * bmilekic at xxxxxxxxxxx
Best case is as I outlined above... no critical section or bus locked
operation is required at all, just an interlock against an IPI on
the current cpu and that can be done by setting a field in
the globaldata structure.
In FreeBSD-5 this sort of interlock could very well be a bit more
expensive because in FreeBSD-5 you have no choice but to use a critical
section to prevent your thread from being preemptively switched to another
cpu, and even if you could get away without a critical section you would
still have no choice but to use integrated %fs:gd_XXX accesses to
access fields in your globaldata (again because outside of a critical
section your thread can be preemptively switched to another cpu).
The preemptive switching in FreeBSD-5 is the single biggest problem
it has. It makes it impossible to implement optimal algorithms
which take advantage of the per-cpu globaldata structure because you
are forced to obtain one or more mutexes, OR us a critical section
(which is a lot more expensive in FreeBSD-5 then in DFly) no matter
what you do. It's a cascade effect... and circular reasoning too.
We want to use mutexes, oh no mutexes obtained by interrupts may
conflict with mutexes obtained by mainline code, oh gee that means
we should do priority borrowing and, wow, that means that one mainline
kernel thread can preempt another mainline kernel thread indirectly,
oh fubar, that means that all of our globaldata accesses have to be
atomic (no caching of 'mycpu'), so we can't easily access per-cpu
globaldata based caches without using a mutex. See? circular reasoning.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list