altq spinlock and if_start dispatch

Sat Apr 12 07:47:16 PDT 2008

Hi all,

In an experiment I conducted ~one weeks ago, I found out that ifnet
serializer contention (if_output and NIC's txeof) had negative effect
on network forwarding performance, so I created the following patch:
http://leaf.dragonflybsd.org/~sephe/ifq_spinlock.diff3

The ideas behind this patch are:
1) altq is protected by its own spinlock instead of ifnet's serializer.
2) ifnet's serializer is pushed down into each ifnet.if_output
implementation, i.e. if_output is called without ifnet's serializer
being held
3) ifnet.if_start is dispatched to the CPU where NIC's interrupt is
handled or where polling(4) is going to happen
4) ifnet.if_start's ipi sending is avoided as much as possible, using
the same mechanism as avoiding real ipi sending in ipiq implementation

I considered dispatching outgoing mbuf to NIC's interrupt/polling CPU
to do the enqueue and if_start thus to avoid the spinlock, but upper
layer, like TCP, processes ifq_handoff error (e.g. ENOBUFS);
dispatching outgoing mbuf will break original semantics.  However, the
only source of error in ifq_handoff() is from ifq_enqueue(), i.e. only
ifq_enqueue() must be called directly on the output path.  Spinlock is
chosen to protect ifnet.if_snd, so ifq_enqueue() and ifnet.if_start()
could be departed. '1)' has one implication that
ifq_poll->driver_encap->ifq_dequeue does not work, but I think driver
could be easily converted to do ifq_dequeue+driver_encap without
packet lossing.  '1)' is the precondition to make '2)' and '3)' work.
'2)' and '3)' together could avoid ifnet serializer contention.  '4)'
is added, since my another experiment shows that serious ipiq overflow
could have very bad impact on overall performance.

I have gathered following data before and after the patch:
http://leaf.dragonflybsd.org/~sephe/data.txt
http://leaf.dragonflybsd.org/~sephe/data_patched.txt

boxa --+----> em0 -Routing box- em1 ----> msk0
boxb --+

boxa (2 x PIII 1G) and boxb (AthlonXP 2100+) each has one 32bit PCI
82540 em.  Fastest stream I could generate from these two boxes using
pktgen is @~720Kpps (~370Kpps on boxb and ~350Kpps on boxa).  Routing
box (it has some problem with ppc, which generated interrupts @~4000/s
:P) is AthlonX2 3600+, with 1000PT; it has no problem to output
@1400Kpps on single interface using pktgen.  msk0 has monitor turned
on and has no problem to accept a stream @1400Kpps.
FF -- fast forwarding
"stream target cpu1" -- stream generated to be dispatched to CPU1 on Routing box
"stream target cpu0" -- stream generated to be dispatched to CPU0 on Routing box
"stream target cpu0/cpu1" -- stream generated to be evenly dispatched
to CPU0 and CPU1 on Routing box
The stats is generated by:
netstat -w 1 -I msk0

Fast forwarding is improved a lot in BGL case, probably because the
time consumed on input path is greatly reduced by if_start
dispatching.
This patch does introduce regression in MP safe case when em0/em1 is
on different CPU than the packet's target CPU on Routing box, may be
caused by ipi bouncing between two CPUs (I didn't find the source of
the problem yet).
Fast forwarding perf drops a lot in MP safe case, if em0 and em1 are
on different CPUs; reducing serializer contention does help (~40Kpps
improvement).  Something still needs to be figured out.

So please review the patch.  It is not finished yet, but the major
part has been done and I want to call for reviewing before I strand
too far away.

Best Regards,
sephe

-- 
Live Free or Die