IP forwarding performance (git.2fb36fa, normal 4.40Mpps, fast 5.23Mpps)

Fri Feb 1 01:53:02 PST 2013

Hi all,

Now the multiple TX queue support is finished in master (generic layer
is done, igb(4) is converted).
Here is the performance I currently get as of git 2fb36fa.

Quick summary, the multiple TX queue support gives me:
+200Kpps for 2 bidirectional normal IP forwarding (now 4.40Mpps)
+160Kpps for 2 bidirectional fast IP forwarding (now 5.23Mpps)

During the performance measurement, the system is very responsive.

Detailed information, please read the following inline comment.

On Thu, Dec 27, 2012 at 4:42 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
> Hi all,
>
> Before I move on to the next big ticket (multiple-tx queue support),
> here is the performance I currently got as of git 2aa7f7f.
>
> Quick summary, the IFQ packets staging mechanism gives me:
> +80Kpps for 2 bidirectional normal IP forwarding (now 4.20Mpps)
> +30Kpps for 2 bidirectional fast forwarding (now 5.07Mpps)
>
> Detailed information, please read the following inline comment.
>
> On Thu, Dec 20, 2012 at 3:03 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
>> On Fri, Dec 14, 2012 at 5:47 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
>>> Hi all,
>>>
>>> This email serves as the base performance measurement for further
>>> network stack optimization (as of git 107282b).
>>
>> Since bidirectional fast IP forwarding is already max out the GigE
>> limit, I increase the measurement strength a bit.  The new measurement
>> is against git 7e1fbcf
>>
>>>
>>>
>>> The hardware:
>>> mobo ASUS P867H-M
>>> 4x4G DDR3 memory
>>> CPU i7-2600 (w/ HT and Turbo Boost enabled, 4C/8T)
>>> Forwarding NIC Intel 82576EB dual copper
>>
>> The forwarding NIC is now changed to 82580EB quad copper.
>>
>>> Packet generator NICs Intel 82571EB dual copper
>>>
>>>
>>> A emx1 <---> igb0 forwarder igb1 <---> emx1 B
>>
>> The testing topology is changed into following configure:
>> +---+                 +-----------+                 +---+
>> |   | emx1 <---> igb0 |           | igb1 <---> emx1 |   |
>> | A |                 | forwarder |                 | B |
>> |   | emx2 <---> igb2 |           | igb3 <---> emx2 |   |
>> +---+                 +-----------+                 +---+
>>
>> Streams:
>> A.emx1 <---> B.emx1 (bidirectional)
>> A.emx2 <---> B.emx2 (bidirectional)
>>
>>>
>>> A and "forwarder", B and "forwarder" are directly connected using CAT6 cables.
>>> Polling(4) is enabled on igb1 and igb0 on "forwarder".  Following
>>> tunables are in /boot/loader.conf:
>>> kern.ipc.nmbclusters="524288"
>>> net.ifpoll.user_frac="10"
>>> net.ifpoll.status_frac="1000"
>
> net.link.ifq_stage_cntmax="8"
>
>>> Following sysctl is changed before putting igb1 into polling mode:
>>> sysctl hw.igb1.npoll_txoff=4
>>
>> sysctl hw.igb1.npoll_txoff=1
>> sysctl hw.igb2.npoll_txoff=2
>> sysctl hw.igb3.npoll_txoff=3

The above sysctls are no longer needed, since all 8 hardware TX queues
are enabled.  The CPUID offset is always 0 (i7-2600 has 8 HT).

>
> sysctl hw.igb0.tx_wreg_nsegs=16
> sysctl hw.igb1.tx_wreg_nsegs=16
> sysctl hw.igb2.tx_wreg_nsegs=16
> sysctl hw.igb3.tx_wreg_nsegs=16
>
>>
>>>
>>>
>>> First for the users that are only interested in the bulk forwarding
>>> performance:  The 32 netperf TCP_STREAMs running on A could do
>>> 941Mbps.
>>>
>>>
>>> Now the tiny packets forwarding performance:
>>>
>>> A and B generate 18 bytes UDP datagrams using
>>> tools/tools/netrate/pktgen.  The destination addresses of the UDP
>>> datagrams are selected that the generated UDP datagrams will be evenly
>>> distributed the to the 8 RX queues, which should be common in the
>>> production environment.
>>>
>>> Bidirectional normal IP forwarding:
>>> 1.42Mpps in each direction, so total 2.84Mpps are forwarded.
>>> CPU usage:
>>> On CPUs that are doing TX in addition to RX: 85% ~ 90% (max allowed by
>>> polling's user_frac)
>>> On CPUs that are only doing RX: 40% ~ 50%
>>
>> Two sets of bidirectional normal IP forwarding:
>> 1.03Mpps in each direction, so total 4.12Mpps are forwarded.
>
> 1.05+Mpps in each direction, so total 4.20Mpps are forwarded.

1.10+Mpps in each direction, so total 4.40Mpps are forwarded.

>
>> CPU usage:
>> On CPUs that are doing TX in addition to RX: 90% (max allowed by
>> polling's user_frac)
>> On CPUs that are only doing RX: 70% ~ 80%
>
> Not much improvement on CPU usage.

All CPUs now do RX and TX, the CPU usage is 90% (max allowed by
polling's user_frac)

>
>> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s
>
> IPI rate on CPUs that are doing TX in addition to RX: ~4.5K/s

No more cross CPU IPIs, packet processing is now completely CPU localized.

>
>>
>>>
>>> Bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1)
>>> 1.48Mpps in each direction, so total 2.96Mpps are forwarded.
>>> CPU usage:
>>> On CPUs that are doing TX in addition to RX: 65% ~ 70%
>>> On CPUs that are doing RX: 30% ~ 40%
>>
>> Two sets of bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1)
>> 1.26Mpps in each direction, so total 5.04Mpps are forwarded.
>
> ~1.27Mpps in each direction, so total 5.07Mpps are forwarded.

~1.31Mpps in each direction, so total 5.23Mpps are forwarded.

>
>> CPU usage:
>> On CPUs that are doing TX in addition to RX: 90% (max allowed by
>> polling's user_frac)
>> On CPUs that are only doing RX: 60% ~ 70%
>
> Not much improvement on CPU usage.

All CPUs now do RX and TX, the CPU usage is 90% (max allowed by
polling's user_frac)

>
>> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s
>
> IPI rate on CPUs that are doing TX in addition to RX: ~5K/s

No more cross CPU IPIs, packet processing is now completely CPU localized.

Best Regards,
sephe

--
Tomorrow Will Never Die