load balancing (was: Re: re(4) update)

Mon Oct 30 11:29:11 PST 2006

:How did you dispatch between those boxes?  Use a round-robin DNS, or
:something more sophisticated?

    It was basically just round robin DNS.  There was never a need to
    do anything more.  The boxes stayed fairly well load balanced simply
    due to the number of connections per second they were handling.

:How did you relay the data on those boxes?  Just simple poll/read/write
:cycles?  I realize that you just said that it just added 2ms delay, but
:doesn't the transferred data get copied several times around in the box?
: What's the fuzz about zero-copy then, if this copying doesn't matter?
:
:cheers
:  simon

    Pretty much just select()-based non-blocking I/O.

    You can only consider zero-copy in the context of situations where
    the cpu doesn't actually have to access the data.  The moment you
    have to do *ANY* processing of the data, whether it be scanning
    the data or calling strlen() or whatever, most of the benefits of
    zero-copy go out the window.

    The first access to any DMA'd data almost always requires a cache load,
    no matter how little data has been DMA'd.  Further accesses (or copies)
    of the data after it has been loaded into the L1/L2 cache are for all
    intents and purposes free.  This is easily demonstrated:

    # cd /usr/src/sys/test/sysperf
    # make /tmp/mem2
    # /tmp/mem1 2048
    docopy1  1.650s 6087104 loops =  0.271uS/loop
    docopy1 2048 7554.55 MBytes/sec			<<<<< 7.5 GBytes/sec
							      (L1 cache)

    # /tmp/mem1 16384
    docopy1  1.947s 943680 loops =  2.064uS/loop
    docopy1 16384 7939.53 MBytes/sec			<<<<< 7.9 GBytes/sec
							      (L1 cache)

    # /tmp/mem1 65536
    docopy1  1.993s 59904 loops = 33.275uS/loop
    docopy1 65536 1969.48 MBytes/sec			<<<<< 1.9 GBytes/sec
							      (L2 cache)

    # /tmp/mem1 1048576
    docopy1  2.011s 2048 loops = 981.848uS/loop
    docopy1 1048576 1067.94 MBytes/sec			<<<<< 1.0 GBytes/sec
							      (caches blown)

    Now with sufficient hardware support, in particular hardware IP AND
    TCP checksum generation, AND careful alignment of the data payload
    during DMA, it *IS* possible with a zero-copy mechanism
    to read() and write() data without the cpu ever having to touch it.
    If you consider this mechanism:

    while ((n = read(socket1, buf, PAGE_SIZE)) > 0)
	write(socket2, buf, n);

    The cpu doesn't actually have to touch the data at all if the networking
    hardware is capable of processing and generating the TCP and the IP
    checksum AND a zero-copy mechanism has been implemented to use VM tricks
    to map the buffer.

    I personally believe that this was one of the original reasons why
    zero-copy was implemented.  It was largely superceeded by sendfile()
    which is zero-copy by design, to the point where I do not personally
    believe that there is any good argument for a zero-copy VM solution
    for userland OTHER then sendfile().

    If you actually had to access the data in the buffer, even a simple
    lookup, that forces the cpu to read the data into its caches.  Once
    in the cache further accesses or copies are pretty much in the noise.
    This means that zero-copy is only truely beneficial if the data NEVER
    has to hit the cpu caches.

    This also means that zero-copy will not significantly benefit most
    practical networking applications.  Packet routing or bridging... yes
    (at least for large packets).  sendfile() - already implemented for
    sendfile, it's just a matter of the network hardware supporting TCP and
    IP checksumming to get zero-copy operation.  Just about everything else?
    Not really.

    Unfortunately, benchmarks tend to exaggerate the benefits of a zero-copy
    mechanism, because benchmarks basically measure a system using 100%
    of its cpu doing nothing BUT the networking operations being tested.
    In real life applications usually do a lot more then just copy data
    without looking at it, though.  For that matter, no programmer in their
    right mind runs a machine at 100% cpu (where all 100% of the cpu is
    doing something critical to the operation of the machine, we're not
    talking about eating idle cycles with setiathome here!).  The cost
    structure of machines verses bandwidth or machines vs revenue is such
    that it is almost always going to be a better idea to add a few more
    machines to your network or spend a little time redesigning your 
    topology or algorithm verses depending on some ultra-optimized piece
    of code which then locks you into either a proprietary solution or into
    a single very specialized piece of freeware with no room to wiggle.

    This leaves us with only sendfile() as being a significant beneficiary
    of a zero-copy mechanism.  When a web server uses sendfile() in
    combination with networking hardware that does the TCP/IP checksumming
    itself (so the kernel doesn't have to access the data at all), the
    data path winds up being DISKDMA->MEMORY + MEMORY->NETWORKDMA and
    never appears in the cpu's L1/L2 caches at all.  If you then have 
    other things using those caches (such as other aspects of the web
    server that require data processing), you get a real benefit. 

    That is the only worthwhile case I can think of, and we already do it.

						-Matt