lkwt in DragonFly

Matthew Dillon dillon at apollo.backplane.com
Sat Feb 7 18:04:14 PST 2004


:   How do you determine whether another CPU holds the token when your
:   thread on the current cpu wants/needs it?  Or do you always
:   synchronously IPI the other cpu when you want a token for writing?

    The cpu owning the token is stored in the token structure.  No bus locked
    instruction is required to test (tok->gd == mycpu) because only we can
    give a token away to another cpu, which synchronously updates our cpu's
    cache to set tok->gd != mycpu, and only the cpu owning the token can
    hand it off to us, which means that the contents of tok->gd will only
    be read back as == mycpu if another cpu has handed it off to us, and
    once another cpu has done that only our cpu can give it up.

    struct lwkt_token {
	globaldata_t gd;	/* cpu owning token */
	...
    }

    So, obtaining a token in the cache case could be as simple as this:

    lwkt_gettoken(lwkt_token_t tok)
    {
	globaldata_t gd = mycpu;

	if (gd->gd_cachetok)
		/* deal with previously fast-cached tokens */
	gd->gd_cachetok = tok;	/* interlock against IPI */
	if (tok->gd == gd)	/* optimal case, done */
		return;

	... fall through to less optimal code ...
	... leave gd_cachetok set to tok indicating that we want the token ...
	send_ipi requesting to tok->gd requesting 'tok'.
	...
    }

    lwkt_reltoken(lwkt_token_t tok)
    {
	globaldata_t gd = mycpu;

	if (gd->gd_cachetok == tok) {
		gd->gd_cachetok = NULL;
		return;
	}
	... less optimal code. ..
    }

    Now I am not enitrely sure what the %fs:gd overhead is to load
    'mycpu', but ignoring that issue for the moment the rest of the
    code will execute optimally within the processor's pipeline.
    (Also, in DFly since threads cannot be preemptively switched to
    another cpu, the contents of 'mycpu' is effectively fixed and
    can be cached by a procedure).

    The key issue here is that the optimal case becomes nothing more
    complex then a few lines of standard C code, easily inlined and
    easily pipelined by the cpu.

    Another key performance issue is the fact that since the tokens do
    not actually create deadlockable situations, a great deal of code 
    where you might be obtaining and releasing several mutexes in FBsd-5
    should devolve into just a single token in DFly.  In particular I am
    thinking of cases within the lockmgr, cases where a mutex must be 
    passed as an argument in FBsd-5 so it can be released and reacquired
    in a deeper procedure that blocks, and cases where you have to play pussy
    foot with mutexes in FBsd-5 to deal with lock ordering issues (e.g.
    unlock B, lock A, relock B).  I think that if you were to compare the
    overhead of a single mutex to a single token that you would not see much
    of a difference.... but the idea in DFly is to greatly reduce the number
    of operations that would otherwise have to occur anyway so it would be
    a more fair comparison to compare two or three mutexes against a single
    token.

    (ultimately, of course, only a real live test can tell whether there
    is an actual performance benefit).

:   If you always IPI for a write, then all accesses to data structures
:   protected by tokens need to be classified into either reads or
:   writes.
:
:   What about if you have multiple CPUs holding a token for reading only
:   and then one comes along and wants to have the token for writing?
:   What exactly happens?  Do you have a way of figuring out whether
:   another CPU holds the token for reading and an IPI is required, or do
:   you always IPI everyone?

    Ok, so the question is how do we detect when a token can or cannot be
    given away to another cpu?  It turns out to be fairly simple for the
    exclusive case:  since only one cpu can 'own' the token, if you
    are on that cpu you basically do not have to do anything because you
    already own the token (remember, the token is not a lock, only a
    serializing entity).  If you are on a different cpu you send an IPI
    request to the cpu owning the token (tok->cpu).  The IPI code on the
    target cpu simply checks to see if the token is being held by the
    currently running thread.  There are two ways of doing this:  First,
    if we take the simple case where a thread only owns one token,
    the token may be held in gd->gd_cachetok.  Otherwise the token would
    be held in the current thread's token array:

    /*
     * IPI callback (basically like a FAST int.  entered with us in a 
     * critiacl section):
     */
    get_token_remote(tokreq_t req)
    {
	globaldata_t gd = mycpu;
	token_t tok = req->tok;

	if (tok->cpu == gd) {
	    if (gd->gd_cachetok != tok)
		tok->cpu = req->cpu;
	} else {
	    send_ipi(tok->cpu, get_token_remote, req); /* message chasing */
	}
    }

    Now,that's a simplification.  If a thread can own multiple tokens
    then we would store them in an array in the thread and we would have
    to iterate through the array... but since a thread will either own 
    just zero or one token 'most of the time', the performance case is 
    the zero or one token case, not the multiple token case.

				MULTIPLE READER TOKEN CASE

    Ok, now multiple-reader tokens (RCU style serialization).  I think this
    is a situation where we would need to have a cpu mask embedded in the
    the token structure and then use locked bus instructions to set and clear
    bits in the mask.  However, there is no requirement that we actually
    clear the bit in the mask when we release the token so the performance
    case for RCU would be able to run without any locked bus cycles.  The
    trade-off is that if another cpu wants to get the token *exclusiveily*
    (exclusive serialization) it would have to send an IPI to all the cpus
    in the mask which might include cpus that do not actually own the token.

    In fact, we definitely do not want to clear the cpu bitmask bit in the
    token when releasing it, at least not immediately, because that would
    greatly reduce thread switching performance (the LWKT scheduler would
    have to run through the tokens owned by the old thread and release their
    cpu bits, then set the cpu bits on the tokens owned by the new thread).
    So:

    lwkt_gettoken_shared(lwkt_token_t tok)
    {
	globaldata_t gd = mycpu;
	gd->gd_cachetok = tok;	/* interlock against IPI on current cpu */
	if ((tok->cpumask & (1 << gd->gd_cpuid)) == 0)
		atomic_bus_locked_bit_set(&tok->cpumask, ...);
	/* owned by current cpu (exclusively or shared), we are done */
	if (tok->cpu == gd)
		return;
	/* owned by another cpu, do more work to see if it is exclusively
	   owned by the target cpu, perhaps by sending an IPI */
	...
    }

    Remember that for RCU operation obtaining 'read' access is either
    inherent or very cheap, and obtaining write access is expensive.

:  So this means you will have locked bus cycles for token acquisitions
:  UNLESS the token was last held by the current CPU?  How do you
:  determine whether the token is cached locally?  A critical section
:  protecting the check of a flag so as to ensure an incoming IPI does
:  not preempt your check?

    This can be done by interlocking with a globaldata field, without
    using a locked bus cycle or a critical section.

    globaldata_t gd = mycpu;

    gd->gd_cachetok = tok;	/* prevent IPI from giving token away */
    if (tok->cpu == gd)		/* THEN check if our cpu owns the token */
	return;			/* yes, done */
    /* no, request token via (expensive) IPI */

    The IPI code runs on the target cpu, so if we own the token already
    then we need only set gd_cachetok and any IPI sent to us will check it
    and refuse to give the token away to another cpu.

    For the multiple token case the IPI code can for (i = 0;i < td->td_numtoks;
    ++i) if (td->td_tokary[i] == tok) return; /* cannot give token away */.
    (td == current thread)

    This sort of interlock works because all the potentially conflicting 
    operations are running on the cpu owning the token and ONLY the cpu
    owning the token, so no locked bus cycles are required and only a simple
    interlock is needed to deal with IPI interrupts interrupting our code
    just as we are getting or releasing a token.

    If the IPI request cannot be satisfied the IPI code can queue the request
    to the token:

	req->next = tok->reqbase;
	tok->reqbase = req;

    And then the release code can check for tok->reqbase != NULL and
    satisfy the pending request(s) (or next pending request).  Pulling
    a request off the list would have to be done inside a critical section
    since an IPI can add more requests to the list, but this is not a
    performance critical case.

:  costs for acquiring a token for reading and writing will be?  In most
:  cases, from what you said, I'm expecting a typical acquisition to
:  consist of a critical section at best (for a read or if the token is
:  cached on the local cpu), sometimes a bus-locked instruction, and
:  sometimes IPIs?  Pardon the ignorance, but not yet having looked at
:  your implementation, it is still unclear to me as to what a typical
:  token acquisition/deacquisition will look like in DragonFly.
:
:Cheers,
:-- 
:Bosko Milekic  *  bmilekic at xxxxxxxxxxxxxxxx  *  bmilekic at xxxxxxxxxxx

    Best case is as I outlined above... no critical section or bus locked
    operation is required at all, just an interlock against an IPI on
    the current cpu and that can be done by setting a field in 
    the globaldata structure.

    In FreeBSD-5 this sort of interlock could very well be a bit more
    expensive because in FreeBSD-5 you have no choice but to use a critical
    section to prevent your thread from being preemptively switched to another
    cpu, and even if you could get away without a critical section you would
    still have no choice but to use integrated %fs:gd_XXX accesses to
    access fields in your globaldata (again because outside of a critical
    section your thread can be preemptively switched to another cpu).

    The preemptive switching in FreeBSD-5 is the single biggest problem
    it has.  It makes it impossible to implement optimal algorithms
    which take advantage of the per-cpu globaldata structure because you
    are forced to obtain one or more mutexes, OR us a critical section
    (which is a lot more expensive in FreeBSD-5 then in DFly) no matter
    what you do.  It's a cascade effect... and circular reasoning too.
    We want to use mutexes, oh no mutexes obtained by interrupts may
    conflict with mutexes obtained by mainline code, oh gee that means
    we should do priority borrowing and, wow, that means that one mainline
    kernel thread can preempt another mainline kernel thread indirectly,
    oh fubar, that means that all of our globaldata accesses have to be
    atomic (no caching of 'mycpu'), so we can't easily access per-cpu
    globaldata based caches without using a mutex.  See? circular reasoning.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>





More information about the Kernel mailing list