SO_REUSEPORT and accept(2) performance

Tue Jul 23 23:38:37 PDT 2013

Hi all,

The brief summary (as of 9a1272865c3cb6e079e1554031dcc712a881598b):
nonblocking accept(2)+kqueue(2) w/ SO_REUSEPORT, we are doing 335Kconns/s
nonblocking accept(2)+kqueue(2) w/o SO_REUSEPORT, we are doing 100Kconns/s

On DragonFly, in addition to better load balance, SO_REUSEPORT also
gives us ~200% performance boost!

Since the testing is on 1000Mbps network, the nonblocking accept(2) w/
SO_REUSEPORT has maxed out the network device's input path:
~1.35Mpps from netstat -I output, which matches theoretical value for
1000Mbps network (the input are consisted w/ 78B SYN, 66B ACK, 66B FIN
and 66B ACK packets for each connection).

The testing server hardware:
CPU intel i7-2600 w/ hyperthreading enabled (8 HT)
NIC broadcom 5719 (4 RX queues and 4 TX queues, using MSI-X)

The testing server software is:
tools/tools/netrate/accept_connect/kq_accept_server
kq_accept_server -p 5000 -i 8 [-r]
(8 user space processes accept connections, -r turns on SO_REUSEPORT)

The testing client software is:
tools/tools/netrate/accept_connect/connect_client
connect_client -p 5000 -4 10.0.0.49 -i 64
(64 user space processes do the connect)
route change -net 10.0.0.0/24 -msl 10
sysctl net.inet.ip.portrange.last=40000
(these two configures make sure that the client won't run out of local ports)

The network configure:
                            +---------+
                  |+--- emx | client1 |
                  ||        +---------+
                  ||
                  ||        +---------+
                  |+--- emx | client2 |
+--------+        ||        +---------+
| server | bnx ---+|
+--------+        ||        +---------+
                  |+--- emx | client3 |
                  ||        +---------+
                  ||
                  ||        +---------+
                  |+--- bce | client4 |
                            +---------+

"client1"~"client4" run the testing client software simultaneously as
shown above, mainly to generate enough traffic.  "server" runs the
testing server software.

Statistics:

w/ SO_REUSEPORT
nonblocking accept(2) rate: 335Kconns/s
NIC interrupt rate: 6000/s on the first 4 HT
CPU idle time on HTs processing interrupt: ~15%
CPU idle time on HTs not processing interrupt: ~20%
Token contention rate: ~500/s (mostly TCP listen completion queue pool
token and TCP porthash token)

This shows w/ SO_REUSEPORT:
- We still have CPU time to process more connections.
- There is only minor TCP listen completion queue contention and this
probably could be further reduced by binding process to the specific
CPU.

w/o SO_REUSEPORT
nonblocking accept(2) rate: 100Kconns/s
NIC interrupt rate: 6000/s on the first 4 HT
CPU idle time on HTs processing interrupt: ~5% - 70%
CPU idle time on HTs not processing interrupt: ~10% - 80%
Token contention rate: ~20K/s - 600K/s (mostly TCP listen completion
queue pool token)

This shows w/o SO_REUSEPORT:
- TCP listen completion queue contention is obviously too high, i.e.
we are facing scaling problem on this TCP listen socket usage model.

Best Regards,
sephe

-- 
Tomorrow Will Never Die