Study of nginx-1.9.12 performance/latency on DragonFlyBSD-g67a73.

Sepherosa Ziehau sepherosa at gmail.com
Mon May 9 06:26:51 PDT 2016


Study of nginx-1.9.12 performance/latency on DragonFlyBSD-g67a73.

The performance and latency is measured using a modified version of wrk:
https://github.com/sepherosa/wrk.git (sephe/wrk branch).

It mainly adds requests/connection setting and avoids several unnecessary
syscalls.

Hardware configuration:
Server: 2-way E5-2620v2 (24 logical cpus), 32GB DDR3 1600 (4GBx8).
Client: i7-3770, 16GB DDR3 1600 (8GBx2).
NICs: Intel 82599 10Ge connected through DAC.

Network configure:
+--------+                           +--------+
|        |192.168.3.254   192.168.3.1|        |
| Server +---------------------------+ Client |
|        |10Ge        DAC        10Ge|        |
+--------+                           +--------+

MSL of the testing network is changed to 10ms by:
route change -net 192.168.3.0/24 -msl 10

DragonFlyBSD settings:
/boot/loader.conf:
kern.ipc.nmbclusters="524288"
kern.ipc.nmbjclusters="262144"
/etc/sysctl.conf:
kern.ipc.somaxconn=256
machdep.mwait.CX.idle=AUTODEEP
net.inet.ip.portrange.last=40000

And powerd(8) is enabled on both sides during the measures.

NOTE:
Unlike other nginx performance measures, which use nginx default number
of requests/connection (100) or even intentionally use infinite number
of requests/connection, we use three values for requests/connection
through out these measures: 1 request/connection, 4 requests/connection
and 14 requests/connection, which are more close to the real world
usage; as noted in RFC6928 that 35% of HTTP requests are made on new
connection, and according to the data from httparhive.com around 2014:
https://discuss.httparchive.org/t/distribution-of-http-requests-per-tcp-connection/365

NOTE:
Unless otherwise noted: polling(4) @1000hz and IW4 are used.  32 workers
are used and 'reuseport' option is enabled in nginx-1.9.12.

==========================

The effect of DragonFlyBSD polling(4).

The results of the following command, with interrupt and different
polling frequency settings:
./wrk -c 15000 --connreqs 1 -d 600s -t 8 --latency --delay
http://192.168.3.254/1K.bin

(15000 concurrent connections, 1 request/connection, 1KB web object, 600
seconds average).

          intr (7700/s) | poll (7000hz) | poll (4000hz) | poll (1000hz)
         ---------------+---------------+---------------+---------------
Reqs/s       116961     |    140580     |    142862     |    144807
LatAvg       64.14ms    |    54.25ms    |    52.87ms    |    51.20ms
LatStdev     150.30ms   |    21.68ms    |    19.16ms    |    13.96ms

So in addition to greatly improving the performance (~20%, even if we
set the polling rate close to the interrupt rate), polling also reduces
the average latency and latency stdev.  And the lower the polling rate,
the better the performance.

==========================

The effect of 'reuseport' option in the nginx-1.9.12 on DragonFlyBSD.

The results of the following command, with 'reuseport' option on and
off on nginx:
./wrk -c 15000 --connreqs X -d 600s -t 8 --latency --delay
http://192.168.3.254/1K.bin

(15000 concurrent connections, X requests/connectin X={1,4,14}, 1KB web
object, 600 seconds average).

           1 request/connection
         no reuseport | reuseport
        --------------+-----------
Reqs/s       45589    |   144807
content      1200K/s  |   30K/s

           4 request/connection
         no reuseport | reuseport
        --------------+-----------
Reqs/s       158603   |   227856
content      1300K/s  |   100K/s

          14 request/connection
         no reuseport | reuseport
        --------------+-----------
Reqs/s       246833   |   250335
content      500K/s   |   150K/s

So 'reuseport' option drastically improves the performance when the
requests/connection is low (~210% for 1 request/connection, and ~40%
for 4 requests/connection).  And obviously 'reuseport' option reduces
contention rate much.

==========================

The number of workers in nginx-1.9.12 on DragonFlyBSD (interaction of
non-power-of-2 number of cpus and power-of-2 number of netisrs for
nginx 'reuseport' option on DragonFlyBSD).

The results of the following command, with different number of workers:
./wrk -c 15000 --connreqs 1 -d 600s -t 8 --latency --delay
http://192.168.3.254/1K.bin

(15000 concurrent connections, 1 request/connection, 1KB web object, 600
seconds average).

          16 workers | 24 workers | 32 workers
         ------------+------------+------------
Reqs/s      132645   |   143276   |   144807
LatAvg      46.48ms  |   54.14ms  |   51.20ms
LatStdev    27.88ms  |   18.29ms  |   13.96ms
content     20K/s    |   33K/s    |   30K/s

Since the server has 24 logical cpus, 16 workers give less performance
than 24/32 workers, even if it matches the number of netisrs.  Its
latency and contention rate is also lower because less requetss are
handled.

24 workers have ted bit lower performance, higher latency and content
rate than 32 workers.  Why? ;).  It's mainly because the how
DragonFlyBSD implements the SO_REUSEPORT: Incoming TCP connections are
dispatched to the netisr (power-of-2) based on the SYN's RSS hash value,
and from there the listen socket's inpcb is looked up based on the same
RSS hash value.  If the number of listen sockets were not power-of-2,
i.e. not aligned with the number of netisrs, certain amount of extra
contention would happen, which reduces performance and increases
latency.  That's why 32 workers (aligned with 16 netisrs on the server)
act better than 24 workers on DragonFlyBSD.

==========================

Web object size, performance and interface bit rate on DragonFlyBSD.

The results of the following command, with 'reuseport' option on and
off on nginx:
./wrk -c 15000 --connreqs 1 -d 600s -t 8 --latency --delay
http://192.168.3.254/_X_K.bin

(15000 concurrent connections, 1 request/connectin, _X_ KB web
object _X_={1,8,16}, 600 seconds average).

         1KB object | 8KB object | 16KB object
        ------------+------------+-------------
Reqs/s     144807   |   105100   |   68909
LatAvg     51.20ms  |   70.53ms  |   195.49ms
BitRate    1.7Gbps  |   7.5Gbps  |   9.5Gbps
Idle       0%       |   34%      |   54%

DragonFlyBSD maxes out the 10Ge for 16KB web object (or somewhere
between 8KB and 16KB :).

And as far as I have tested, IW10 does not help either performance
or latency in these measures.

Thanks,
sephe

-- 
Tomorrow Will Never Die



More information about the Users mailing list