re(4) update

Mon Oct 30 09:07:27 PST 2006

:Matthew Dillon wrote:
:>     [If] I actually HAD a critical application that required that kind of
:>     routing or bridging I would buy a dedicated piece of hardware to handle
:>     it, not try to use a general purpose operating system running on 
:>     commodity hardware.
:
:Ironically, at [big e-commerce company], where I work, we're moving away 
:from the dedicated load balancing hardware, though not for any of the 
:reasons you're discussing.  Our traffic volume is high and heterogeneous 
:enough to trigger subtle bugs in the LB firmware.  Getting debug 
:information is next to impossible when they just start dropping packets 
:on the floor.  They're also really expensive, so the vendors' solution 
:-- "You need to buy more of our hardware" -- makes our finance guys cringe.
:
:Rather than replace them with dedicated boxes, though, we'll probably 
:just remove that layer entirely -- clients will negotiate leases onto a 
:single host for a given period of time, with initial discovery through a 
:broadcast mechanism.

    I completely agree with that sentiment, especially for load balancing
    hardware.   I personally believe that it is actually better to have
    a real machine on the front-end accepting connections and manipulating
    the data, then reconnecting to the 'real' machine on the backend and
    doing <--> <--> with the rest of the data stream.  

    I implemented such a solution at BEST Internet to handle
    www.best.com/~user web accesses.  As people may know, we had something
    like 25+ separate user machines, with about 2000 accounts on each.
    But we wanted to have a common URL to frontend all of personal web 
    pages.  I don't quite remember but I think we also used the same
    scheme to shift dedicated WWW domains arounds.

    The solution was to have a couple of cookie-cutter boxes front-ending
    all the WWW connections, doing the first few protocol interactions
    (e.g. processing the WWW command and the Host: header), then
    looking the info up in a DBM and reconnecting to the actual machine.

    This scheme had many, many advantages.

    * As almost pure networking applications (with only a couple of DBM
      lookups occuring, all easily cacheable in memory)... these machines
      never crashed.  And I mean never.  We found a networking card that
      worked perfectly (fxp I think) and never had a single problem.

    * It didn't matter which machine handled a connection.  Each machine
      was an exact cookie cutter copy.  A DBM update was pushed out to 
      the boxes once an hour via cron.

    * I could add or remove machines at will.  Bandwidth and loading was
      never an issue.  We never needed more then 3 boxes, though.

    * We could dip into the data stream at will to diagnose problems.
      All the data was running through userland.  It wasn't using bridging
      or package manipulation.  I also had the programs keep statistics.

    * The machines could serve as buffers against load spikes and since
      they processed the first few protocol commands and headers they also
      had a tendancy to be able to intercept DOS attacks made against our
      web servers.

    And so on and so forth.  I did a similar protocol intercept for POP3
    and used MX records and MX forwarding to buffer SMTP (mainly to
    offload the crazy DNS load so the shell boxes wouldn't have to cache
    tens of thousands of domain names in their local DNS servers).

    What's really interesting is that these forwarding boxes added something
    like 5ms of delay to the data streams, and nobody ever noticed or cared.

    These days accessing any major web site can take multiples of seconds
    due to the complexity of the site, all the separate DNS domains that
    the client machine has to lookup to process the page, and backend 
    latency (servlet startup, etc).  If I bring up any major web site,
    like www.sfgate.com or slashdot or the JPL or ANY major news site
    it takes no LESS then 5 seconds for the web page to load, and sometimes
    upwards of 10 seconds.  My bank's web site is just as bad.  And it isn't
    because of network bandwidth issues.  I think one could introduce upwards
    of 20ms of networking latency on the web server side and not notice
    any difference.

    What is crazy is that nobody bothers to benchmark this problem.  All the
    benchmarks you see published are basically measuring how many 
    milliseconds it takes to load a simple HTML page or a few graphics, and
    they pat themselves on the back for being 2ms faster then the 
    competition.  Nobody gives a $#@% about 2ms, or 10ms.  That isn't the
    problem any more.  

    Similarly, if running a GigE link into a server requires it to operate
    at 100% capacity, the problem is with the design of the service, not
    with the fact that the network driver eats up a lot of cpu when running
    at 100% capacity.  If I have a big web server and I am shoving out 
    500 MBits of data a second, then my main worry is going to be the cost
    of transporting that data over the internet relative to which the cost
    of the server is pretty much zip.  Except for very, very rare cases I
    am not really going to give a rats ass when a single tcp connection is
    unable to saturate a GiGE link.

						-Matt