load balancing (was: Re: re(4) update)
Matthew Dillon
dillon at apollo.backplane.com
Mon Oct 30 11:29:11 PST 2006
:How did you dispatch between those boxes? Use a round-robin DNS, or
:something more sophisticated?
It was basically just round robin DNS. There was never a need to
do anything more. The boxes stayed fairly well load balanced simply
due to the number of connections per second they were handling.
:How did you relay the data on those boxes? Just simple poll/read/write
:cycles? I realize that you just said that it just added 2ms delay, but
:doesn't the transferred data get copied several times around in the box?
: What's the fuzz about zero-copy then, if this copying doesn't matter?
:
:cheers
: simon
Pretty much just select()-based non-blocking I/O.
You can only consider zero-copy in the context of situations where
the cpu doesn't actually have to access the data. The moment you
have to do *ANY* processing of the data, whether it be scanning
the data or calling strlen() or whatever, most of the benefits of
zero-copy go out the window.
The first access to any DMA'd data almost always requires a cache load,
no matter how little data has been DMA'd. Further accesses (or copies)
of the data after it has been loaded into the L1/L2 cache are for all
intents and purposes free. This is easily demonstrated:
# cd /usr/src/sys/test/sysperf
# make /tmp/mem2
# /tmp/mem1 2048
docopy1 1.650s 6087104 loops = 0.271uS/loop
docopy1 2048 7554.55 MBytes/sec <<<<< 7.5 GBytes/sec
(L1 cache)
# /tmp/mem1 16384
docopy1 1.947s 943680 loops = 2.064uS/loop
docopy1 16384 7939.53 MBytes/sec <<<<< 7.9 GBytes/sec
(L1 cache)
# /tmp/mem1 65536
docopy1 1.993s 59904 loops = 33.275uS/loop
docopy1 65536 1969.48 MBytes/sec <<<<< 1.9 GBytes/sec
(L2 cache)
# /tmp/mem1 1048576
docopy1 2.011s 2048 loops = 981.848uS/loop
docopy1 1048576 1067.94 MBytes/sec <<<<< 1.0 GBytes/sec
(caches blown)
Now with sufficient hardware support, in particular hardware IP AND
TCP checksum generation, AND careful alignment of the data payload
during DMA, it *IS* possible with a zero-copy mechanism
to read() and write() data without the cpu ever having to touch it.
If you consider this mechanism:
while ((n = read(socket1, buf, PAGE_SIZE)) > 0)
write(socket2, buf, n);
The cpu doesn't actually have to touch the data at all if the networking
hardware is capable of processing and generating the TCP and the IP
checksum AND a zero-copy mechanism has been implemented to use VM tricks
to map the buffer.
I personally believe that this was one of the original reasons why
zero-copy was implemented. It was largely superceeded by sendfile()
which is zero-copy by design, to the point where I do not personally
believe that there is any good argument for a zero-copy VM solution
for userland OTHER then sendfile().
If you actually had to access the data in the buffer, even a simple
lookup, that forces the cpu to read the data into its caches. Once
in the cache further accesses or copies are pretty much in the noise.
This means that zero-copy is only truely beneficial if the data NEVER
has to hit the cpu caches.
This also means that zero-copy will not significantly benefit most
practical networking applications. Packet routing or bridging... yes
(at least for large packets). sendfile() - already implemented for
sendfile, it's just a matter of the network hardware supporting TCP and
IP checksumming to get zero-copy operation. Just about everything else?
Not really.
Unfortunately, benchmarks tend to exaggerate the benefits of a zero-copy
mechanism, because benchmarks basically measure a system using 100%
of its cpu doing nothing BUT the networking operations being tested.
In real life applications usually do a lot more then just copy data
without looking at it, though. For that matter, no programmer in their
right mind runs a machine at 100% cpu (where all 100% of the cpu is
doing something critical to the operation of the machine, we're not
talking about eating idle cycles with setiathome here!). The cost
structure of machines verses bandwidth or machines vs revenue is such
that it is almost always going to be a better idea to add a few more
machines to your network or spend a little time redesigning your
topology or algorithm verses depending on some ultra-optimized piece
of code which then locks you into either a proprietary solution or into
a single very specialized piece of freeware with no room to wiggle.
This leaves us with only sendfile() as being a significant beneficiary
of a zero-copy mechanism. When a web server uses sendfile() in
combination with networking hardware that does the TCP/IP checksumming
itself (so the kernel doesn't have to access the data at all), the
data path winds up being DISKDMA->MEMORY + MEMORY->NETWORKDMA and
never appears in the cpu's L1/L2 caches at all. If you then have
other things using those caches (such as other aspects of the web
server that require data processing), you get a real benefit.
That is the only worthwhile case I can think of, and we already do it.
-Matt
More information about the Users
mailing list