[netmp] socket accesses

Fri Aug 8 07:20:27 PDT 2008

OK, long mail, but if you have *any* familiarity with the network stack I
really want your opinion! Please try skimming through it at least and
point out anything you think might be a problem. You don't need to
do much research; a simple "I think there might be a problem in the
interaction with $FOO, please recheck" will do.

If you haven't been paying attention, see here too

http://wiki.dragonflybsd.org/index.cgi/NetMP

Short story: sockets are accessed both from the protocol threads
and processes running in the kernel. Currently, only the mp lock
protects the socket. This needs to change.

I've been going through the struct socket fields, and so_state stands
out as a potential source of trouble. Let's focus on the send/receive
paths at first:

flags used in	tcp_input: SS_CANTRCVMORE, SS_NOFDREF, SS_RCVATMARK
		soreceive: SS_ISCONFIRMING, SS_CANTRCVMORE,
			SS_ISCONNECTED|SS_ISCONNECTING, SS_RCVATMARK
		tcp_output: none
		sosend: SS_CANTSENDMORE, SS_ISCONNECTED, SS_ISCONFIRMING,
		udp_input: none
		udp_output: none

SS_ISCONFIRMING is never set and will be removed.

Taking a closer look, soreceive() tests SS_ISCONNECTED|SS_ISCONNECTING in

if ((so->so_state & (SS_ISCONNECTED|SS_ISCONNECTING)) == 0 &&
    (pr->pr_flags & PR_CONNREQUIRED)) {
	error = ENOTCONN;
	goto release;
}

Now, for TCP, if my understanding is correct, if either SS_ISCONNECTED or
SS_ISCONNECTING are 1, they will stay 1 while soreceive is running. This
is because the connection cannot be disconnected until we do a close() or
at least a shutdown() and we assume that only one thread/process can be
messing with the socket at a time. OTOH, if SS_ISCONNECTED and SS_ISCONNECTING
are both 0, they cannot go to 1 unless we issue a connect(), so again we're
safe. For UDP, PR_CONNREQUIRED is not set; other protocols will be under the
BGL.

sosend() uses SS_ISCONNECTED like this:

if ((so->so_state & SS_ISCONNECTED) == 0) {
	/*
	 * `sendto' and `sendmsg' is allowed on a connection-
	 * based socket if it supports implied connect.
	 * Return ENOTCONN if not connected and no address is
	 * supplied.
	 */
[S1]	if ((so->so_proto->pr_flags & PR_CONNREQUIRED) &&
	    (so->so_proto->pr_flags & PR_IMPLOPCL) == 0) {
		if ((so->so_state & SS_ISCONFIRMING) == 0 &&
		    !(resid == 0 && clen != 0))
			gotoerr(ENOTCONN);
	} else if (addr == 0)
[S2]	    gotoerr(so->so_proto->pr_flags & PR_CONNREQUIRED ?
		   ENOTCONN : EDESTADDRREQ);
}

Happily, PR_IMPLOPCL is set for TCP and PR_CONNREQUIRED is clear for UDP,
so the "if" in [S1] is always false for the cases I care about. In [S2],
if the socket is not connected then either

a) it's not in SS_ISCONNECTING: unless we issue a connect() (which as explained
above can't happen while sosend() is running), so_state can't go to
SS_ISCONNECTED so we're OK.

b) it is in SS_ISCONNECTING, so we're talking about TCP: the socket state CAN
change to SS_ISCONNECTED after our check. But, since we're the only ones
accessing the socket and we are not in a connect() system call (since we're in
sosend()), this means a connect() completed asynchronously. In that case, we
can just pretend that userspace tried to send while we were still trying to
connect; userspace has no way of knowing that is not the case.

BTW, you'd think the connection state would only be modified by the protocol
thread, but kern_connect() takes it upon itself to clear SS_ISCONNECTING if
there was an error. I'm inclined to change so->so_state &= ~SS_ISCONNECTING
to KKASSERT(!(so->so_state & SS_ISCONNECTING)) and fix any protocol connect
functions that hit the assertion to clean up after themselves...

Am I missing anything in the above analysis? It all seems a bit too neat; if
that's how things are, we needn't bother about the connection state changing
from under us... I'm worried I'm "seeing" what I'd like to be true here.

Turning to SS_CANTRCVMORE, tcp_input() references it in two ways:

if (so->so_state & SS_CANTRCVMORE) {
	m_freem(m);
} else {
	m_adj(m, drop_hdrlen); /* delayed header drop */
	ssb_appendstream(&so->so_rcv, m);
}

and

if (so->so_state & SS_CANTRCVMORE) {
	soisdisconnected(so);
	callout_reset(tp->tt_2msl, tcp_maxidle,
		      tcp_timer_2msl, tp);
}

this appears hopeless; one can maybe try to live with the race in the first
case (doing extra cleanup) but in the second case that would be very ugly. The
obvious "solution" is to require the protocol thread to obtain a socket lock
before messing with this flag (this requires "breaking" so_state in two
different fields). Same goes for SS_CANTSENDMORE. This isn't a "common" case,
but it has the disadvantage that it can force the protocol thread to spin for
a while... Great Ideas (tm) welcome.

Oh and if that's acceptable, requiring locking for so_oobmark and SS_RCVATMARK
is a no-brainer. I don't think it's worth my time to even investigate
lockless oob data delivery...

AFAICT SS_ASYNC needs no changes. It's only set/cleared by soo_ioctl and tested
in sowakeup(). It should be okay if some thread changes it asynchronously.

WRT SS_NOFDREF:

- it is set on error by socreate(), but at this point the socket is not
  reachable yet
- it is set by soclose() and I think that if the socket is reachable at this
  point it's a bug anyway (someone may try to access it after it's been
  free()ed). [edit: well, apparently that is not the case: soo_close() and
  some network filesystems remove the reference to the socket after calling
  soclose(). That should be easy to fix, right?]
- it is cleared by soaccept(); soaccept() is called by the old netgraph (do we
  care?) and kern_accept(). The socket passed to soaccept() is definitely not
  reachable from any other userspace thread (no fd pointing to it yet). From
  what I can tell the proto thread shouldn't be messing with SS_NOFDREF for
  a socket that is just being accepted (XXX: not 100% sure though).
- it is set by sonewconn(); the socket was just created and is not yet visible
- it is cleared by sys_sctp_peeloff(): don't care, don't want to know
- it is cleared in tcp_attach(). Two paths end up calling tcp_attach() (through
  ->pru_attach()): one starts from sonewconn() and the other from socreate().
  In both cases this is a new socket.

So, unless I'm missing something, SS_NOFDREF is only modified before the socket
becomes relevant or after the socket is no longer reachable.

Moving to so_incomp:

it's modified by sofree() (only called by protocol thread), sonewconn (same) and
soisconnected(). The latter is called from process context (fifo_open,
portal_connect at least, maybe sctp too) and from the bluetooth netisr.

->so_comp is modified by soclose (called from process context), soisconnected,
sonewconn (see above), soaccept_predicate, sctp_get_peeloff (of course!),
ng_ksocket_finish_accept().

Assuming we can move all list modification in protocol thread context, the
list traversal would still be tricky. Maybe a spinlock protecting the lists and
queue lengths is in order for now? The same lock could be reused for protecting
SS_INCOMP, SS_COMP. In the future we might try something more clever if this
becomes a performance issue. Opinions?

The other so_* fields seem to be easier; I'll send a separate mail for those.

TIA,
Aggelos