You could do worse than Mach ports

Fri Jul 18 11:49:40 PDT 2003

First off... thanks for answering all of these questions... you have 
been
getting the waterfall treatment the last few days :).

I have questions below :).
On Friday, July 18, 2003, at 01:12 PM, Matthew Dillon wrote:
:	I am REALLY intersted in the user->user case...  In fact I've had 
some
:     ideas that have to do with Mach-like port permissions and exposed
:regions
:     of memory based on the MPI-2 One Sided [RDMA] semantics.  
However I
:haven't
:     gotten very far with this idea on paper to even decide if its
:doable....
:     I want to expose a region of memory for outside access to a known
:group of
:     processes or threads for a certain amount of time at which point
:that memory
:     could be thought of as "don't touch" for the duration of the
:epoch... All
:     accesses to that memory could be considered "remote" during the
:epoch using
:     only "put" and "get" requests which I would be relying on the VM
:layer to do
:     the writes and reads to the memory in the exposed buffer... even
:for the local
:     process.
:
:	Does this sound feasible?  Like I said,  I haven't gotten very 
far....
:but this
:    	API is present, more or less, on high-speed Infiniband hardware
:drivers as well
:	as Myrinet GM where there is hardware to do the DMA accesses needed 
to
:prevent
:	interrupting the CPUs of remote nodes so they can continue crunching
:data while
:	messages flow through a cluster.  Its quite beautiful and elegant in
:that context.
:
:	In user<->user messaging it would just be a natural extension, I
:think, of this
:	idea.  However I have not counted on context switches and other 
things
:that may
:	need to occur in a BSD/Unix like kernel that may make this design
:horrible.

    Well, anytime you have to play with VM mappings you incur a 
horrible
    cost verses not having to and just making a direct call.   I guess
    it depends on how much data the user process winds up manipulating
    outside of the don't-touch periods.

    In an implementation of the above scheme it might be easier simply 
to
    make the memory don't-touch all the time, rather then just during
    the epoch, and rely on "put" and "get" to do the right thing.
I think I understand... this would be a special buffer that is 
allocated for the
purpose of exposure.  I was thinking about having the option to expose 
a simple
array of bytes that may already exist... eliminating further copies.... 
perhaps
I should crawl before I walk though :).

    If the user process is manipulating a *LOT* of data then a 
double-buffered
    approach might be the way to go, where the user process always has 
access
    to a kernel-providing writable buffer and when it stages it into 
the
    kernel the kernel replaces the user VM page with a new buffer 
(making
    the old one now in the sole domain of the kernel).
So the kernel would swap out the buffer that was previously owned in 
user space
to its own ownership and replace that buffer with something of equal 
size..

Are you saying that the puts and the gets would go the kernel buffer 
only...
until the process that caused the buffer-swap tells the kernel it wants 
its
buffer back.

    I'm not sure whether these schemes would apply to DragonFly.  
There are
    DMA issues in all UNIXes which I believe only NetBSD has solved 
with
    UVM so far.  In FreeBSD vfs_busy_pages() is responsible for 
preventing
    user writes to pages undergoing write I/O.  In DragonFly we will
    eventually replace that with a COW scheme like UVM has.
Well I hadn't thought it out that deeply... but MPI doesn't say the user
"can't" write to the pages during an access epoch.  The standard does 
say
that if you do you completely invalidate all guarantees for the 
consistency
of the buffer.  I would be comfortable with that in an IPC system for 
local
processes as well.

:
:	queueing occurs only in the synchronous case then?  I need to see 
that
:AmigaOS :).

    Queueing only occurs in the asynchronous case.  In the synchronous 
case
    the port agent completes processing of the message and returns a
    synchronous error code (which is anything other then EASYNC).
That makes more sense :).

    Of course, the port agent is free to do whatever it wants... it 
could
    very well use queueing internally and spin on the result, then 
return
    a synchronous result.  It is 'opaque' though of course the intent 
is
    for it to queue and return EASYNC instead of blocking in that case.

Sure... the exposed behavior to the end user is all that must be 
consistent.

    I can see cases where a port agent might occassionally block.. 
under
    exceptional circumstances that are outside the critical path, such 
as
    in critical low memory situations.
But if a port agent blocks perhaps the whole process isn't active 
anymore...
to the user non-blocking appearance could be maintained. [as long as 
one doesn't
hang the kernel trying to achieve it]

Thanks again Matt,

Dave

					-Matt
					Matthew Dillon
					<dillon at xxxxxxxxxxxxx>