Turn on adaptive MPSAFE for network threads, MPSAFE for IP and ARP

Wed Mar 11 04:36:32 PDT 2009

On Wed, Mar 11, 2009 at 8:09 AM, Matthew Dillon
<dillon at apollo.backplane.com> wrote:
>
> :On my Phenom9550 (2GB memory) w/ dual port 82571EB, one direction
> :forwarding, packets even spreaded to each core.  INVARIANTS is turned
> :on in the kernel config (I don't think it makes much sense to run a
> :system without INVARIANTS).
> :
> :...
> :
> :For pure forwarding (i.e. no firewalling), the major bottle neck is on
> :transmit path (as I have measured, if you used spinlock on ifq, the
> :whole forwarding could be choked).  Hopefully ongoing multi queue work
> :could make the situation much better.
> :
> :Best Regards,
> :sephe
>
>    I wonder, would it make sense to make all ifq interactions within
>    the kernel multi-cpu (one per cpu) even if the paricular hardware
>    does not have multiple queues?  It seems to me it would be fairly easy
>    to do using bsfl() on a cpumask_t to locate cpus with non-empty queues.

Yes, it is absolutely doable with plain classic ifq.  IMHO, for other
types of ifq, situation is different.  Even if we could make per-cpu
internal mbuf queues, the major part of ifq internal states updating
still needs kinda of protection; the quickest way popping up in my
mind is spinlock or serializer, however, this may eliminate the
benefit we could obtain from the per-cpu internal mbuf queues.

>
>    The kernel could then implement a protected entry point into the device
>    driver to run the queue(s).  On any given packet heading out the interface
>    the kernel would poll the spin lock and enter the device driver if it
>    is able to get it.  If the spin lock cannot be acquired the device
>    driver has already been entered into by another cpu and will pick up the
>    packet on the per-cpu queue.

Except the per-cpu ifq, our current way is only slightly different
from your description.  In the current implementation, when the first
CPU enqueue a packet to ifq, it will first check whether the NIC TX is
starting, if it is, then after enqueuing the packet, the current
thread just keep going.  If the NIC TX is not started, the current CPU
will mark the NIC TX to be started and try enter the NIC's serializer.
 If the serializer trying failed, which indicates there is a
contention between interrupt or poll, then the current CPU will send
an ipi to NIC's interrupt handling CPU or the NIC's polling CPU.

So, except for the cost of the ifq serialization, the ipi sending, I
mention above, also has some cost under certain work load.  I
originally planned following way to avoid ifq serializer cost and
amortize the ipi sending cost:
We add a small mbuf queue (save 32 or 64 mbufs at most) per-cpu
per-ifq, transmit thread just enqueue packets to this queue, once this
queue overflows or the current thread is going to sleep, a ipi is send
to NIC's interrupt CPU or its polling CPU.  ifq enqueuing happens
there as well as the calling of if_start().  This may also help when #
of hardware TX queues < ncpus2.  However, it has one requirement that
all network output happens in the network threads (I will have to say
this is also one of the major reason I put main parts of TCP callouts
into TCP threads).  Well, it is a vague idea currently ...

Best Regards,
sephe

--
Live Free or Die