dillon at apollo.backplane.com
Mon Oct 30 09:07:27 PST 2006
:Matthew Dillon wrote:
:> [If] I actually HAD a critical application that required that kind of
:> routing or bridging I would buy a dedicated piece of hardware to handle
:> it, not try to use a general purpose operating system running on
:> commodity hardware.
:Ironically, at [big e-commerce company], where I work, we're moving away
:from the dedicated load balancing hardware, though not for any of the
:reasons you're discussing. Our traffic volume is high and heterogeneous
:enough to trigger subtle bugs in the LB firmware. Getting debug
:information is next to impossible when they just start dropping packets
:on the floor. They're also really expensive, so the vendors' solution
:-- "You need to buy more of our hardware" -- makes our finance guys cringe.
:Rather than replace them with dedicated boxes, though, we'll probably
:just remove that layer entirely -- clients will negotiate leases onto a
:single host for a given period of time, with initial discovery through a
I completely agree with that sentiment, especially for load balancing
hardware. I personally believe that it is actually better to have
a real machine on the front-end accepting connections and manipulating
the data, then reconnecting to the 'real' machine on the backend and
doing <--> <--> with the rest of the data stream.
I implemented such a solution at BEST Internet to handle
www.best.com/~user web accesses. As people may know, we had something
like 25+ separate user machines, with about 2000 accounts on each.
But we wanted to have a common URL to frontend all of personal web
pages. I don't quite remember but I think we also used the same
scheme to shift dedicated WWW domains arounds.
The solution was to have a couple of cookie-cutter boxes front-ending
all the WWW connections, doing the first few protocol interactions
(e.g. processing the WWW command and the Host: header), then
looking the info up in a DBM and reconnecting to the actual machine.
This scheme had many, many advantages.
* As almost pure networking applications (with only a couple of DBM
lookups occuring, all easily cacheable in memory)... these machines
never crashed. And I mean never. We found a networking card that
worked perfectly (fxp I think) and never had a single problem.
* It didn't matter which machine handled a connection. Each machine
was an exact cookie cutter copy. A DBM update was pushed out to
the boxes once an hour via cron.
* I could add or remove machines at will. Bandwidth and loading was
never an issue. We never needed more then 3 boxes, though.
* We could dip into the data stream at will to diagnose problems.
All the data was running through userland. It wasn't using bridging
or package manipulation. I also had the programs keep statistics.
* The machines could serve as buffers against load spikes and since
they processed the first few protocol commands and headers they also
had a tendancy to be able to intercept DOS attacks made against our
And so on and so forth. I did a similar protocol intercept for POP3
and used MX records and MX forwarding to buffer SMTP (mainly to
offload the crazy DNS load so the shell boxes wouldn't have to cache
tens of thousands of domain names in their local DNS servers).
What's really interesting is that these forwarding boxes added something
like 5ms of delay to the data streams, and nobody ever noticed or cared.
These days accessing any major web site can take multiples of seconds
due to the complexity of the site, all the separate DNS domains that
the client machine has to lookup to process the page, and backend
latency (servlet startup, etc). If I bring up any major web site,
like www.sfgate.com or slashdot or the JPL or ANY major news site
it takes no LESS then 5 seconds for the web page to load, and sometimes
upwards of 10 seconds. My bank's web site is just as bad. And it isn't
because of network bandwidth issues. I think one could introduce upwards
of 20ms of networking latency on the web server side and not notice
What is crazy is that nobody bothers to benchmark this problem. All the
benchmarks you see published are basically measuring how many
milliseconds it takes to load a simple HTML page or a few graphics, and
they pat themselves on the back for being 2ms faster then the
competition. Nobody gives a $#@% about 2ms, or 10ms. That isn't the
problem any more.
Similarly, if running a GigE link into a server requires it to operate
at 100% capacity, the problem is with the design of the service, not
with the fact that the network driver eats up a lot of cpu when running
at 100% capacity. If I have a big web server and I am shoving out
500 MBits of data a second, then my main worry is going to be the cost
of transporting that data over the internet relative to which the cost
of the server is pretty much zip. Except for very, very rare cases I
am not really going to give a rats ass when a single tcp connection is
unable to saturate a GiGE link.
More information about the Submit