rc and smf

Matthew Dillon dillon at apollo.backplane.com
Thu Feb 24 13:48:15 PST 2005


:Matthew Dillon wrote:
:>     Hmm.  Well, I have to say that in my opinion a service failure is a 
:>     critical bug in the application.   I usually go in and fix the application
:
:Nobody argues this. Again, this is one of the reasons why people
:supervise in the first place. There's nothing stopping you to add
:an alert feature to a supervisor.
:
:>     software rather then write monitoring programs for it (other then to 
:>     tell me if it has failed).   Most service oriented applications fork()
:>     on connect (a DNS cache being an exception), and those that have the
:
:Nothing stops the parent process of your forked children to be killed or
:crashed, obviously for some reasons already discussed.

    Be killed ... by what?   Crashing ... due to what?   The problem here
    is that you are just throwing out examples without paying any attention
    to the likelihood that the issue might actually occur under normal 
    (or even exceptional) system operation.  It's like you don't trust that
    a for(i = 0; i < 10; ++i) loop will actually count properly and you want
    to protect against it possibly not counting properly.

    You are saying "what if" instead of "how often".  Just because something
    might POTENTIALLY happen doesn't mean that it WILL happen or that it will
    happen often enough to warrent protection or that it will EVER happen
    in the particular environment you are trying to protect.  People get hit
    by lightning all the time but that doesn't mean we wear a faraday cage
    jacket every time we go outside!  Hard drives fail all the time, but 
    most consumer systems still ship with just one.  And, frankly, it's far
    more likely that your RAID storage system will fail then many of the
    things you are pulling out as examples.

    I don't bother putting a crash monitor on sendmail and apache because,
    well, sendmail hasn't actually crashed on me for at least 20 years,
    and apache hasn't crashed on me since I used it.  Slow down, yes.
    Get behind on the queues, yes.  Have a CGI/backend database failure, 
    absolutely.  But the primary connection accepting server actually
    crash?  Hasn't happened.

    If I want my apache server to be robust I write a monitoring program
    that runs on an entirely DIFFERENT machine, and doesn't just test
    whether the connection works, but actually goes in and issues a real
    query that exercises the most complex CGI/database path I can find,
    and screams bloody hell if that fails.

    Dan, we could argue what-if's all day long, because there are an 
    infinite number of what-if scenarios.  It's like pulling a rabbit out
    of your hat.  The problem is that just throwing out these scenarios
    doesn't actually help anyone running a REAL production server.  You
    are trying to solve problems that you don't have rather then trying
    to solve the problems that you do have.  That's the real issue here.

    Now, a lot of people on these lists, including me, have tried to explain
    this to you, but you aren't seeming to get it.  You are still focusing
    on what-if scenarios that might occur once a decade or not at all instead
    of solving the REAL problem facing you, which in the case of that
    mail proxy service is simply configuring the program to limit the
    number of simultanous connections it can handle.  And if it doesn't
    have such a configuration option, then it's broken and you should either
    fix it or replace it with something better.  It's that simple.  You
    don't need overcommit, you don't necessarily need service monitoring.
    If the program is otherwise reliable you just need a simple 
    configuration variable.

						-Matt






More information about the Kernel mailing list