git: socket: Extend SO_REUSEPORT to distribute workload to available sockets
Sepherosa Ziehau
sephe at crater.dragonflybsd.org
Tue May 21 23:09:44 PDT 2013
commit 740d1d9f7b7bf9c9c021abb8197718d7a2d441c9
Author: Sepherosa Ziehau <sephe at dragonflybsd.org>
Date: Mon May 13 21:48:10 2013 +0800
socket: Extend SO_REUSEPORT to distribute workload to available sockets
The idea is from Linux's recently added SO_REUSEPORT support from Google:
https://lwn.net/Articles/542629/
(thank aggelos@ for pointing it to me)
In DragonFly, SO_REUSEPORT is already supported. However, the original
support only allows the first wildcard address bound socket or the last
non-wildcard address bound socket to receive input, e.g. accept(2) on TCP
socket or receive datagrams on UDP socket; the rest of the sockets bound
to the same port will _not_ get any input.
In this commit, we extend SO_REUSEPORT to allow all sockets bound to the
same address and same port to receive input based on the input packet's
hash, so the workload, e.g. accept(2) or datagram reception, could be
evenly distributed among different sockets (imagine each socket is
handled by one process/thread). This extension could also reduce the
contention from user space on TCP listen socket's so_comp or UDP socket's
so_rcv, when it is compared with the traditinally and commonly used one
socket model.
The implementation details:
- Introduce inp_localgroup, which groups inpcbs bound to the same address
and same port.
- Add inp_localgroup hash table to inpcbinfo. This hash table is
allocated only for protocols supporting SO_REUSEPORT extension.
Currently only TCP and UDP support SO_REUSEPORT extension.
- When inpcb is inserted into inpcbinfo wildcard hash table, it is also
inserted into the cooresponding inp_localgroup.
- Before locating inpcb from inpcbinfo wildcard hash table, we check
inpcbinfo's inp_localgroup hash table first. If there is a matching
inp_localgroup, packet hash will be used to pick one of the inpcbs from
the inp_localgroup, and this inpcb will be used for further processing
on this packet. Packet hash's bits (ncpus2_shift), which are used to
dispatch packet to the proper netisr, are ignored, since they may
introduce unfairness between inpcbs in the same inp_localgroup.
Hash-threshold instead of modulo-N is used to pick the inpcb from the
inpcbs in the same inp_localgroup (http://tools.ietf.org/html/rfc2992
for hash-threshold and modulo-N).
inp_localgroup
hash table
| : |
+----------+ +--------------+ +--------------+
| 79 | |inp_localgroup| |inp_localgroup|
+----------+ +--------------+ +--------------+
| 80 |----->| *:80 |----->|192.168.2.1:80|
+----------+ +--------------+ +--------------+
| 81 | | inpcb1 | | inpcb4 |
+----------+ +--------------+ +--------------+
| : | | inpcb2 |<--+
+--------------+ |
| inpcb3 | |
+--------------+ |
| input SYN dst 10.0.0.1:80
|
| 15 3 2 0
| +-------------+---+
| | hash |
| +-------------+---+
+--|<-- used -->| (ncpus == 8)
Limitation:
- Each inp_localgroup could hold at most 256 inpcbs, which probably
should be enough.
- Jailed sockets will not be entered into inp_localgroup, since the
original inpcb preference of in_pcblookup_hash() must be kept.
- Wildcard IPv4 mapped INET6 sockets will not be entered into
inp_localgroup, since the original inpcb preference of
in_pcblookup_hash() must be kept.
- If one of the sockets in the inp_localgroup is closed, e.g. the process
handles the socket is crashed: For TCP, certain amount of TCP syncache
may be dropped prematurely by syncache timeout and the sockets on the
closed socket's so_comp are all closed. For UDP, all of the datagrams
on the closed socket's so_rcv are dropped. However, these will happen
even before this commit.
Sysctl nodes net.inet.tcp.reuseport_ext and net.inet.udp.reuseport_ext
are added to enable/disable this SO_REUSEPORT extension on TCP and UDP.
They are enabled by default.
Summary of changes:
sys/netinet/in_pcb.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++-
sys/netinet/in_pcb.h | 48 ++++++---
sys/netinet/tcp_input.c | 13 ++-
sys/netinet/tcp_subr.c | 2 +
sys/netinet/udp_usrreq.c | 11 ++-
5 files changed, 299 insertions(+), 23 deletions(-)
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/740d1d9f7b7bf9c9c021abb8197718d7a2d441c9
--
DragonFly BSD source repository
More information about the Commits
mailing list