Of webservers and datacenters...

Mon Oct 30 21:33:40 PST 2006

Matthew Dillon wrote:
    These days accessing any major web site can take multiples of seconds
    due to the complexity of the site, all the separate DNS domains that
    the client machine has to lookup to process the page, and backend 
    latency (servlet startup, etc). [...] And it isn't
    because of network bandwidth issues.  I think one could introduce upwards
    of 20ms of networking latency on the web server side and not notice
    any difference.
Yep.  We're heavily into service-oriented architecture (SOA), and we 
probably have one of the better implementations out there -- heck, we're 
sometimes mentioned in the trade rags as the poster child for SOA.  It's 
still agonizingly slow, and I believe it's the main bottleneck in our 
systems.

I view SOA not as a way of organizing software -- it's really a way of 
organizing developers.  Or, to put it another way, SOA is what you move 
to when you can't or won't impose discipline upon your development teams 
yet you still need to ship features.  The downside is it makes your site 
slow (especially as a page starts using more services).  In response, 
you keep shipping more gee-whiz features so folks (hopefully) won't notice.

    What is crazy is that nobody bothers to benchmark this problem.
Actually, we do.  We measure almost everything and have bandwidth 
dedicated for log pulling and metrics publishing.  I'm on a team which 
owns a set of services fronting a cluster of core databases, and our 
pagers go wild whenever latencies jump up.

Alas, improving performance is not a business priority.  It's more about 
keeping the status quo as more gee-whiz features pop into existence.

    If I have a big web server and I am shoving out 
    500 MBits of data a second, then my main worry is going to be the cost
    of transporting that data over the internet relative to which the cost
    of the server is pretty much zip.
For us, external bandwidth is a mostly-fixed cost.  (Changing this 
requires a large effort -- and, yes, we've done it twice, and both times 
it involved devoting a significant part of our engineering effort to do 
this.)  We did spend a lot of effort minimizing hardware cost recently 
-- we were getting crushed by teams whose approach to scaling was to 
throw more machines at it instead of rewriting O(n^2) algorithms into 
O(n log n).  (Actually, my team recently discovered a client who was 
hitting us with O(n^2) requests when they could've been doing O(1)... 
<sigh>)

Part of the problem with hardware is there's an associated cost of 
supplying it with electricity (don't forget generators and UPSes during 
electric outages), cool air, and datatechs to just build and repair the 
dang things.