lkwt in DragonFly

Fri Feb 6 18:36:47 PST 2004

    First, Bosko, thanks for asking your question!  You really got me 
    thinking about our token implementation.  I've been whining about it
    being broken on our mailing lists for quite a while and as I was 
    formulating a reply to your email I had an epiphany.  So I hope you
    don't mind me having CC'd your email and my reply with the kernel@ list!

:Matt,
:
:    Reading this:
:
:Token Passing Primitives Not Mutexes
:...
:
:   Do you intend to have a single pcpu token on a given cpu protecting
:   all data structures (or a class of data structures)?  The tradeoff of
:   the lesser-granularity vs the admittedly much lower cost for a common
:   token acquisition may or may not be worth it in the end.  (The other
:   part to the tradeoff is that you can never go with full-preemption).
:   I did understand this correctly, right?
:
:Cheers,
:-- 
:Bosko Milekic  *  bmilekic at xxxxxxxxxxxxxxxx  *  bmilekic at xxxxxxxxxxx

    Not entirely.  First, please note that the current token implementation
    does not work very well... it's a mess to use because cpu #B can steal
    away cpu #A's token by sending cpu #A an IPI message requesting the token.
    This means cpu #A has to be in a critical section to 'protect' the token,
    and must re-acquire the token (in case it was lost) after it blocks.

    After thinking about this quite a bit I have come up with a solution
    that, in a moment of enlightment, would make our tokens operate like
    self-contained little RCU implementations (superior to Linux's 
    implementation, in fact!).

    I intend to fix our tokens by not allowing an IPI to steal a token until
    the owning cpu has gone through a thread switch, or by explicit
    release.  This means that, once fixed, our token implementation will
    allow us to abstract RCU-like operations, which I think is very 
    important.  It is not what I originally intended tokens to be
    but it is turning out to be what they should be.

    The nice thing about the revamp I intend to do on the token code is that
    the RCU abstraction will wind up being a lot better then Linux's RCU
    abstraction, because it will be compartmentalized.  Only cpus competing
    for the same token are effected rather then all cpus in the system.
    We would not have the serialization problem that Linux has.

    Tokens will be used in a manner similar to how mutex are used in
    FreeBSD-5, except that in our case (with the revamped token code)
    DFly will be able to hold any number of tokens across a blocking
    situation without creating deadlocks (but also not guarenteeing
    atomicy across the blocking situation).  I think this is exactly the
    sort of abstraction we need.  Mutexes in FBSD-5 and tokens in DFly
    are primarily used as interlocks on structures, for which no blocking
    occurs, or in situations where the blocking condition is well manageed
    (e.g. lockmgr).  I feel that my new token (RCU-like) implementation
    will wind up being far superior to FreeBSD-5's mutex implementation
    for 90% of these situations and it will not suffer from the deadlock
    potential that mutexes have because it will not guarentee atomicy
    across a blocking call.

    Also note that several major DragonFly subsystems (LWKT scheduler and
    SLAB allocator) are totally mutex & token free and things like the
    network stack will also be mostly mutex & token free because it will
    assign PCBs to particular threads (so operations on those PCBs are only
    done by a particular thread which in turn means no locking of the PCB is
    required at all).

    The performance trade off is basically the ability for the cpu owning a
    token to operate on the token without having to use bus-locked
    instructions, verses having to send an IPI message to the owning cpu
    when the requesting cpu does not own the token.  This is not the major
    reason for using tokens, though.  Still, judging by the benchmark
    comparisons we've done between 4.x and DFly, the overhead of a 'miss'
    appears to be very small.  It turns out that most locks (like vnode locks)
    hit the same-cpu case a *LOT* more then they hit the foreign-cpu case
    so the overall benefit is at least equivalent to FreeBSD-5's mutex
    overhead.  It certainly isn't worse.  We do lose when it comes to 
    IPC operations between two processes running on difference CPUs, but 
    if this is the only thing that winds up being a problem it is a simple 
    enough case to fix.

    However, the major reason for using tokens is that it creates an
    abstraction that allows us to execute many types of operations as if
    the system had only one cpu.  Since a token is owned
    by a cpu and not a thread, if thread #1 obtains a token and blocks, 
    thread #2 running on the same cpu can obtain that same token optimally.
    Likewise, thread #3 running on a different cpu can 'steal' the token
    (but it only gets transfered on a scheduler boundary of the first cpu
    once I fix the current token code).  We also have the advantage of being
    able to use this abstraction heavily without compromising UP performance,
    because on a UP box a blocking situations implies a scheduler switch
    and since tokens are per-cpu... well, the whole thing devolves into a NOP.

    I have some work to do to move my current token implementation into this 
    new paradigm, but I do not expect it to be difficult.  For example,
    if cpu #B steals a token from cpu #A then the threads that own the token
    on cpu #A can't be scheduled to run again until cpu #B is through with
    the token.  This means actively accounting for and structurally linking
    the dependancy information.

    Whew!  That may seem complex, and in a mutexed environment it would be
    very complex.  But remember in DFly the whole basis of using this 
    abstraction is to be able to operate almost as if one were on a single
    cpu even when one is on a 16-cpu system, so the actual structural
    accounting does not really have to worry about interrupts or stale data
    or other cpus.  A critical section is sufficient to operate on token meta
    data on the owning cpu, and we can actually defer structural linkages
    within tokens we own until a thread switch occurs (which won't happen in
    most cases since like a mutex a token is obtained and released without
    blocking most cases).

    I think I rambled on there.  Please feel free to ask additional 
    questions!

						-Matt