git: socket: Extend SO_REUSEPORT to distribute workload to available sockets

Sepherosa Ziehau sephe at crater.dragonflybsd.org
Tue May 21 23:09:44 PDT 2013


commit 740d1d9f7b7bf9c9c021abb8197718d7a2d441c9
Author: Sepherosa Ziehau <sephe at dragonflybsd.org>
Date:   Mon May 13 21:48:10 2013 +0800

    socket: Extend SO_REUSEPORT to distribute workload to available sockets
    
    The idea is from Linux's recently added SO_REUSEPORT support from Google:
    https://lwn.net/Articles/542629/
    (thank aggelos@ for pointing it to me)
    
    In DragonFly, SO_REUSEPORT is already supported.  However, the original
    support only allows the first wildcard address bound socket or the last
    non-wildcard address bound socket to receive input, e.g. accept(2) on TCP
    socket or receive datagrams on UDP socket; the rest of the sockets bound
    to the same port will _not_ get any input.
    
    In this commit, we extend SO_REUSEPORT to allow all sockets bound to the
    same address and same port to receive input based on the input packet's
    hash, so the workload, e.g. accept(2) or datagram reception, could be
    evenly distributed among different sockets (imagine each socket is
    handled by one process/thread).  This extension could also reduce the
    contention from user space on TCP listen socket's so_comp or UDP socket's
    so_rcv, when it is compared with the traditinally and commonly used one
    socket model.
    
    The implementation details:
    - Introduce inp_localgroup, which groups inpcbs bound to the same address
      and same port.
    - Add inp_localgroup hash table to inpcbinfo.  This hash table is
      allocated only for protocols supporting SO_REUSEPORT extension.
      Currently only TCP and UDP support SO_REUSEPORT extension.
    - When inpcb is inserted into inpcbinfo wildcard hash table, it is also
      inserted into the cooresponding inp_localgroup.
    - Before locating inpcb from inpcbinfo wildcard hash table, we check
      inpcbinfo's inp_localgroup hash table first.  If there is a matching
      inp_localgroup, packet hash will be used to pick one of the inpcbs from
      the inp_localgroup, and this inpcb will be used for further processing
      on this packet.  Packet hash's bits (ncpus2_shift), which are used to
      dispatch packet to the proper netisr, are ignored, since they may
      introduce unfairness between inpcbs in the same inp_localgroup.
      Hash-threshold instead of modulo-N is used to pick the inpcb from the
      inpcbs in the same inp_localgroup (http://tools.ietf.org/html/rfc2992
      for hash-threshold and modulo-N).
    
     inp_localgroup
       hash table
    
      |    :     |
      +----------+      +--------------+      +--------------+
      |    79    |      |inp_localgroup|      |inp_localgroup|
      +----------+      +--------------+      +--------------+
      |    80    |----->|     *:80     |----->|192.168.2.1:80|
      +----------+      +--------------+      +--------------+
      |    81    |      |    inpcb1    |      |    inpcb4    |
      +----------+      +--------------+      +--------------+
      |    :     |      |    inpcb2    |<--+
                        +--------------+   |
                        |    inpcb3    |   |
                        +--------------+   |
                                           |  input SYN dst 10.0.0.1:80
                                           |
                                           |  15           3 2  0
                                           |  +-------------+---+
                                           |  |       hash      |
                                           |  +-------------+---+
                                           +--|<--  used -->| (ncpus == 8)
    
    Limitation:
    - Each inp_localgroup could hold at most 256 inpcbs, which probably
      should be enough.
    - Jailed sockets will not be entered into inp_localgroup, since the
      original inpcb preference of in_pcblookup_hash() must be kept.
    - Wildcard IPv4 mapped INET6 sockets will not be entered into
      inp_localgroup, since the original inpcb preference of
      in_pcblookup_hash() must be kept.
    - If one of the sockets in the inp_localgroup is closed, e.g. the process
      handles the socket is crashed: For TCP, certain amount of TCP syncache
      may be dropped prematurely by syncache timeout and the sockets on the
      closed socket's so_comp are all closed.  For UDP, all of the datagrams
      on the closed socket's so_rcv are dropped.  However, these will happen
      even before this commit.
    
    Sysctl nodes net.inet.tcp.reuseport_ext and net.inet.udp.reuseport_ext
    are added to enable/disable this SO_REUSEPORT extension on TCP and UDP.
    They are enabled by default.

Summary of changes:
 sys/netinet/in_pcb.c     | 248 ++++++++++++++++++++++++++++++++++++++++++++++-
 sys/netinet/in_pcb.h     |  48 ++++++---
 sys/netinet/tcp_input.c  |  13 ++-
 sys/netinet/tcp_subr.c   |   2 +
 sys/netinet/udp_usrreq.c |  11 ++-
 5 files changed, 299 insertions(+), 23 deletions(-)

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/740d1d9f7b7bf9c9c021abb8197718d7a2d441c9


-- 
DragonFly BSD source repository


More information about the Commits mailing list