User/Kernel MPI [LONG, Newbie, Questions]
    Nathaniel W. Filardo 
    nwf at andrew.cmu.edu
       
    Mon Feb  7 02:14:26 PST 2005
    
    
  
Hello all.
	I'm writing mostly because I'm curious about the topic; I'm 
probably not sufficiently an ubercoder to actually write the 
implementation but I'd still like to understand the design and 
implementation.  If I'm jumping the gun and wondering about something 
that's not yet on the radar, let me know.  Also, apologies that this email 
is damn long.
First off, is there any documentation on the messaging and threading model 
beyond those on www.dragonflybsd.org/goals ?
Anyway, my friend and I spent a lot of time thinking about the proposed 
threading/messaging model and developed it more fully than the docs we'd 
read (he's looking for a project for a class; seems to have moved on to 
L4/HURD but I'm still curious about DFly).  So really, can somebody tell 
me if the below is correct or at least on the right track?
=== Scheduler changes
First off, we introduce a userspace thread scheduler to replace the 
current, entirely in-kernel userland scheduler (similar in concept to 
NetBSD's scheduler activation sa_* API).  This exposes to the kernel an 
entrypoint which may be called with arguments (such as which signals are 
pending, possibly with messages pending - if scatter/gather queues are 
used the head of each new chain must be passed in; if the buffer is 
static, it can be created once, registered with the kernel, and never 
passed back from the kernel - this is really an implementation detail, 
though I think the former is preferable as an initial opinion).  Details 
of when this is called are entangled below.
Secondly, we have to introduce some kind of shared state between the 
kernel and userspace.  If somebody has a way of avoiding this, do let me 
know [the NetBSD people seem to use lots of syscalls which seems worse, to 
me].  Firstly, the userland scheduler must inform the kernel of how many 
threads are currently runnable (RUNNABLE_COUNTER).  This implies that the 
userland scheduler needs to provide, as one would expect, thread_yield() 
and similar functions which manage this counter in a MP-safe way.  The 
kernel merely reads this number and the worst case of misreading is a 
delayed schedule() or a spurious schedule().  If the kernel additionally 
knows the locking mechanism employed it can delay reading until the lock 
is released.  Note that while this number is redundant against the data 
structures below, its presence allows the kernel not to keep persistent 
pointers to thread data structures.
In addition to the userland scheduler's per thread data structures, there 
must be a per-thread shared data structure as well - primarily it must 
contain the thread's runnable state and its context (stored registers). 
We introduced several values for this thread state to take on: ACTIVE, 
RUNNABLE_YIELDED, RUNNABLE_PREEMPTED, BLOCKED_USERLAND, and 
BLOCKED_KERNELSPACE.  Note well that despite the presence of these data 
structures, they are *NOT* persistently pointed to by kernelspace entities 
(excepting for BLOCKING_KERNELSPACE threads - see below), which means that 
userland is free to create or destroy threads at its whim without 
notifying the kernel (though failure to update the RUNNABLE_COUNTER may 
have adverse effects on system performance - both cases [spurious 
schedule() and missed schedule()] are probably traceable in the kernel and 
the process can be reprimanded appropriately).
 + ACTIVE threads are those that are currently running.  The userland 
scheduler NEVER jumps into an active thread; doing so would lead to 
undefined results, as one expects.  The kernel will only see threads of 
this type, excepting when it has to annotate one to BLOCKED_KERNELSPACE 
(see below).
 + RUNNABLE_YIELDED threads are those that voluntarily gave up their 
timeslice via yield(), while RUNNABLE_PREEMPTED are those that the kernel 
preempted at the end of their timeslice.  The distinction is drawn merely 
to enable different thread scheduling algorithms, in some sense.  Note 
that _YIELDED state is assertable by userland in thread_yield() (which 
must also save %eip and the other registers...) but that the kernel must 
WRITE TO the shared data-structure to transition the thread to _PREEMPTED 
(in addition to updating the state the kernel must write its register dump 
there).  The kernel will never see a thread in this state [though it will 
put one in _PREEMPTED... see below].  Preemptable threads requires 
kernel-side support as far as we can tell, but cooperative multitasking 
could do away with that little bit of instrumentation.
 + BLOCKED_USERLAND is the result of a mutex_lock() or msg_waitfor() on a 
purely userland construct.  In all probability the userland-only 
per-thread data-structure will contain additional data for the thread 
scheduler's decision process.  These processes will never be running, so 
the kernel will never see this state.
 + BLOCKED_KERNELSPACE is a very unfortunate state.  It means that the 
thread is in the process of page faulting or similar but that the kernel 
was unable to service the request immediately.  In the event of a 
fast-fault (e.g., mmap'd region with the backing data in page cache, some 
syscalls), this state is avoided as the ACTIVE state prevents the 
scheduler from running the thread and we'll soon be able to return 
(detrampoline) to it.  However, in slow-faults (e.g., swapstorm and we 
just asked for a page way off on disk, most syscalls), to avoid truly 
blocking, the essentially synchronous message that was the page fault 
should be downgraded to an ASYNC message and the kernel's userland 
scheduler reentered.  The exact details of this SYNC->ASYNC conversion are 
not exactly clear but the end result must be that a (set of) LWKT(s) is 
spawned to wait on the result with ultimate goal of reversing the thread's 
annotation to RUNNABLE_PREEMPTED state [in order to facilitate this, the 
kernel obviously needs to keep a pointer out to userland.  It will produce 
undefined results (probably crash) to delete the data structure while the 
kernel is in this state because then it will blindly annotate something 
else.  This (and RUNNABLE_PREEMPTED, really) smells like a rich area for 
bugs and exploits... if somebody has something cleaner, I'm all ears, of 
course.]
The reason that BLOCKED_KERNELSPACE must be introduced is to decouple the 
kernelspace from userland so that the kernel does not have to have 
contexts stored per thread, merely per blocking thread, as well as so the 
userland thread scheduler can, on the next reentry, note that a previously 
active thread has been blocked an can reduce the RUNNABLE_COUNTER 
appropriately.  [This update cannot be done from kernel due to 
MP-safeness].
So the biggest change, all in all, is that the kernel's userland scheduler 
no longer resumes a context - it merely jumps in to the process' thread 
scheduler and leaves the decision of which thread runs to that process. 
This allows for different threading models to match workloads rather than 
trying to tune the global scheduler to deal with it.
=== Message Passing
Now, given all of the above we can start using message passing system 
calls in a real way.  It's possible to use the above without MPI syscalls 
as well as to use MPI syscalls without the above, but neither of those 
seem to be desirable.
In the MPI syscall model, instead of faulting, we will call the 
targetPort->send_msg() function, which may fault or SYSENTER or as 
appropriate in the SYNC case, but we won't see that code or any of the 
instrumentation (debugging, IPC, userland drivers / VFS, etc) that the MPI 
allows - it's been nicely encapsulated away from us.  [Incidentally, the 
MPI can be used safely for IPC as long as the ports are used essentially 
as file descriptors and the ->send_msg() functions are provided by a 
trusted library, but that's another discussion.]
If we send an ASYNC message, our behavior is quite clear.  We go on our 
business and at some point inform the userland thread scheduler that we're 
willing to be blocked on the arrival of a message for us on targetPort 
(timeout optional).
In the event of a SYNC message, there are several cases:
 + The message is handled in full, synchronously, right now.
 + The message is immediately converted to ASYNC and somebody else blocks 
for us [immediate return in this case would be a bug].
 + The message is synchronously trampolined into the kernel and later 
would block.  In this case the kernel must downgrade the remainder of the 
message to ASYNC, spawn off a context (or several, again, depending on 
details of implementation) and sleep the context on the reception of an 
answer to the ASYNC message in flight.  Here the thread is annotated by 
the detrampoline code to BLOCKING_KERNELSPACE and the kernel userland 
scheduler is invoked.
Example of the latter case (possibly incorrect in terminology / details, 
but should give the idea).  This case is overly long but is robust against 
subsystems taking locks, which merely swinging the entire stack to the 
side (spawning LWKT at the innermost layer only) would not be:
  user calls read() in a way that triggers SYNC messages
  read() generates a SYNC message and sends it off
  ->send_msg() trampolines into the kernel synchronously
  the message is routed to the VFS and parsed
  the VFS calls (C style) into the filesystem
  the FS calls (C style) into the block device layer
  the block device layer realized that is has to block.  At this point
    there is a kernel context (stack) associated with everything up to
    this point.  Here the message will be downgraded to ASYNC and
    dispatched to the appropriate block device after the block layer
    creates or reserves a message port.  The block layer spawns a thread
    to wait on the message it sent.
  the block device layer returns a handle to that port back to the FS
    along with an error indicating that it had to block.
  the FS spawns a thread to wait on this port as well as generate a port
    and error return code to the VFS.
  the VFS does the same, returning to
  the detrampoline code recognizes the error and spawns off a thread to
    wait for reception of the message from the VFS to enqueue the result
    for the process' userland thread scheduler.  It annotates the
    currently running thread to BLOCKING_KERNELSPACE.  It then jumps over
    to the kernel process scheduler.
At the end, the kernel stack is fully unwound, having been transferred to 
LWKTs waiting for an answer.  This is probably not ideal, but it works and 
is robust against locking of any type inside the subsystems; I'm sure 
somebody has a better idea.
Why SYNC messages at all?  The short answer is performance.  If every 
inter-subsystem call is translated into an asynchronous message, the 
results can be disastrous for performance, as the kernel can end up 
spending more cycles in message queue management than in actually doing 
the task requested.
=== Implementation details & misc.
+ The mechanisms of segmentation that can be used to give rise to thread 
local storage can be used for all of this with the exception of 
RUNNABLE_COUNTER.  By merely reserving a chunk of memory at the top of the 
TLS area (alternatively, the thread's stack and use a segment for that), 
the kernel can always know the current thread's data structure location 
(It'd be in the segment register) without having to have any notion of 
thread IDs or similar.  This may slightly impact the ABI but I do not 
believe it must as TLS already exists.  This may also have impacts against 
such VMMs as Xen.
+ RUNNABLE_COUNTER can instead be implemented as a syscall-updated 
in-kernel quantity (rather than shared state in the memory area).  This 
has the advantage that the kernel can decrement it on transition to 
BLOCKED_KERNELSPACE as well as always ensure its validity, but the 
disadvantage is that the userland must trampoline up for every thread 
spawned or killed, which may be bad for performance of certain, 
thread-heavy workloads (MozartOZ programs leap to mind, ignoring that 
Mozart currently does its own threading - in an ideal world it could spawn 
real threads for it).
+ ps & friends now have no process-agnostic way of showing threads if 
RUNNABLE_COUNTER is managed in userland.  Every process must be asked for 
how many threads it has.  Perhaps this is another advantage of it being a 
syscall.  Note that it does not otherwise impact viewing process 
information.
+ Register store and restore must be done in userspace on occasion.  I 
assume this isn't much of a problem but I am unfamiliar with the exact 
details of all the permission bits present on x86.
OK, well, that's enough for now. I eagerly await responses.
--nwf;
    
    
More information about the Kernel
mailing list