IP forwarding performance, more improvement (git.8eb1b0, normal 4.61Mpps, fast 5.67Mpps)

Sat Feb 16 22:52:37 PST 2013

Hi all,

After ifnet/ifaddr per-CPU stats work (i.e. cache pollution is
avoided), IP forwarding performance improves again!
Here is the performance I currently get as of git 8eb1b0

Per-CPU stats give me:
+210Kpps for 2 bidirectional normal IP forwarding (now 4.61Mpps)
+440Kpps for 2 bidirectional fast IP forwarding (now 5.67Mpps)

For fast IP forwarding, we are _not_ that far away from max out the 4
GigE interfaces (which is 5.95Mpps)

Detailed information is same as last measurement.

Best Regards,
sephe

On Fri, Feb 1, 2013 at 5:53 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
> Hi all,
>
> Now the multiple TX queue support is finished in master (generic layer
> is done, igb(4) is converted).
> Here is the performance I currently get as of git 2fb36fa.
>
> Quick summary, the multiple TX queue support gives me:
> +200Kpps for 2 bidirectional normal IP forwarding (now 4.40Mpps)
> +160Kpps for 2 bidirectional fast IP forwarding (now 5.23Mpps)
>
> During the performance measurement, the system is very responsive.
>
> Detailed information, please read the following inline comment.
>
> On Thu, Dec 27, 2012 at 4:42 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
>> Hi all,
>>
>> Before I move on to the next big ticket (multiple-tx queue support),
>> here is the performance I currently got as of git 2aa7f7f.
>>
>> Quick summary, the IFQ packets staging mechanism gives me:
>> +80Kpps for 2 bidirectional normal IP forwarding (now 4.20Mpps)
>> +30Kpps for 2 bidirectional fast forwarding (now 5.07Mpps)
>>
>> Detailed information, please read the following inline comment.
>>
>> On Thu, Dec 20, 2012 at 3:03 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
>>> On Fri, Dec 14, 2012 at 5:47 PM, Sepherosa Ziehau <sepherosa at gmail.com> wrote:
>>>> Hi all,
>>>>
>>>> This email serves as the base performance measurement for further
>>>> network stack optimization (as of git 107282b).
>>>
>>> Since bidirectional fast IP forwarding is already max out the GigE
>>> limit, I increase the measurement strength a bit.  The new measurement
>>> is against git 7e1fbcf
>>>
>>>>
>>>>
>>>> The hardware:
>>>> mobo ASUS P867H-M
>>>> 4x4G DDR3 memory
>>>> CPU i7-2600 (w/ HT and Turbo Boost enabled, 4C/8T)
>>>> Forwarding NIC Intel 82576EB dual copper
>>>
>>> The forwarding NIC is now changed to 82580EB quad copper.
>>>
>>>> Packet generator NICs Intel 82571EB dual copper
>>>>
>>>>
>>>> A emx1 <---> igb0 forwarder igb1 <---> emx1 B
>>>
>>> The testing topology is changed into following configure:
>>> +---+                 +-----------+                 +---+
>>> |   | emx1 <---> igb0 |           | igb1 <---> emx1 |   |
>>> | A |                 | forwarder |                 | B |
>>> |   | emx2 <---> igb2 |           | igb3 <---> emx2 |   |
>>> +---+                 +-----------+                 +---+
>>>
>>> Streams:
>>> A.emx1 <---> B.emx1 (bidirectional)
>>> A.emx2 <---> B.emx2 (bidirectional)
>>>
>>>>
>>>> A and "forwarder", B and "forwarder" are directly connected using CAT6 cables.
>>>> Polling(4) is enabled on igb1 and igb0 on "forwarder".  Following
>>>> tunables are in /boot/loader.conf:
>>>> kern.ipc.nmbclusters="524288"
>>>> net.ifpoll.user_frac="10"
>>>> net.ifpoll.status_frac="1000"
>>
>> net.link.ifq_stage_cntmax="8"
>>
>>>> Following sysctl is changed before putting igb1 into polling mode:
>>>> sysctl hw.igb1.npoll_txoff=4
>>>
>>> sysctl hw.igb1.npoll_txoff=1
>>> sysctl hw.igb2.npoll_txoff=2
>>> sysctl hw.igb3.npoll_txoff=3
>
> The above sysctls are no longer needed, since all 8 hardware TX queues
> are enabled.  The CPUID offset is always 0 (i7-2600 has 8 HT).
>
>>
>> sysctl hw.igb0.tx_wreg_nsegs=16
>> sysctl hw.igb1.tx_wreg_nsegs=16
>> sysctl hw.igb2.tx_wreg_nsegs=16
>> sysctl hw.igb3.tx_wreg_nsegs=16
>>
>>>
>>>>
>>>>
>>>> First for the users that are only interested in the bulk forwarding
>>>> performance:  The 32 netperf TCP_STREAMs running on A could do
>>>> 941Mbps.
>>>>
>>>>
>>>> Now the tiny packets forwarding performance:
>>>>
>>>> A and B generate 18 bytes UDP datagrams using
>>>> tools/tools/netrate/pktgen.  The destination addresses of the UDP
>>>> datagrams are selected that the generated UDP datagrams will be evenly
>>>> distributed the to the 8 RX queues, which should be common in the
>>>> production environment.
>>>>
>>>> Bidirectional normal IP forwarding:
>>>> 1.42Mpps in each direction, so total 2.84Mpps are forwarded.
>>>> CPU usage:
>>>> On CPUs that are doing TX in addition to RX: 85% ~ 90% (max allowed by
>>>> polling's user_frac)
>>>> On CPUs that are only doing RX: 40% ~ 50%
>>>
>>> Two sets of bidirectional normal IP forwarding:
>>> 1.03Mpps in each direction, so total 4.12Mpps are forwarded.
>>
>> 1.05+Mpps in each direction, so total 4.20Mpps are forwarded.
>
> 1.10+Mpps in each direction, so total 4.40Mpps are forwarded.
>
>>
>>> CPU usage:
>>> On CPUs that are doing TX in addition to RX: 90% (max allowed by
>>> polling's user_frac)
>>> On CPUs that are only doing RX: 70% ~ 80%
>>
>> Not much improvement on CPU usage.
>
> All CPUs now do RX and TX, the CPU usage is 90% (max allowed by
> polling's user_frac)
>
>>
>>> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s
>>
>> IPI rate on CPUs that are doing TX in addition to RX: ~4.5K/s
>
> No more cross CPU IPIs, packet processing is now completely CPU localized.
>
>>
>>>
>>>>
>>>> Bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1)
>>>> 1.48Mpps in each direction, so total 2.96Mpps are forwarded.
>>>> CPU usage:
>>>> On CPUs that are doing TX in addition to RX: 65% ~ 70%
>>>> On CPUs that are doing RX: 30% ~ 40%
>>>
>>> Two sets of bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1)
>>> 1.26Mpps in each direction, so total 5.04Mpps are forwarded.
>>
>> ~1.27Mpps in each direction, so total 5.07Mpps are forwarded.
>
> ~1.31Mpps in each direction, so total 5.23Mpps are forwarded.
>
>>
>>> CPU usage:
>>> On CPUs that are doing TX in addition to RX: 90% (max allowed by
>>> polling's user_frac)
>>> On CPUs that are only doing RX: 60% ~ 70%
>>
>> Not much improvement on CPU usage.
>
> All CPUs now do RX and TX, the CPU usage is 90% (max allowed by
> polling's user_frac)
>
>>
>>> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s
>>
>> IPI rate on CPUs that are doing TX in addition to RX: ~5K/s
>
> No more cross CPU IPIs, packet processing is now completely CPU localized.
>
>
>
> Best Regards,
> sephe
>
> --
> Tomorrow Will Never Die

--
Tomorrow Will Never Die