rc and smf

Thu Feb 24 11:41:35 PST 2005

    Hmm.  Well, I have to say that in my opinion a service failure is a 
    critical bug in the application.   I usually go in and fix the application
    software rather then write monitoring programs for it (other then to 
    tell me if it has failed).   Most service oriented applications fork()
    on connect (a DNS cache being an exception), and those that have the
    option of forking or running threaded I usually tell to fork.  This
    greatly narrows the amount of code that actually has to run in the
    parent's connection-accepting loop and works as well as or better then a
    service monitor.  The proxies I wrote at BEST Internet a long time ago
    all did that, and those applications never failed.  Not once.  Ever.
    They handled millions of emails a day.

    Insofar as the remaining applications go, I have seen occassional
    failures and certainly failures can occur, but it isn't a 'random'
    occurance.  Some applications are prone to problems, some never die.
    I have lost older BIND demons to corruption (not actual segfaults),
    but I don't think I've had a dns failure for over two years now, and
    that is plenty long enough for me to prefer having the system yell and
    scream at me if it dies rather then restart and forget.

    The only time a service has failed on crater.dragonflybsd.org has been
    when I screwed it up myself, accidently, or when the hard drive physically
    crashed.  That's it.  I certainly don't spend my time worrying at night
    that random services might not be working!

    But anyhow, back to service failures... service failures do not always
    end in a crash.  Take BIND for example.  It is far more likely that
    BIND's cache will become corrupted then for BIND to actually crash.  A
    simple 'detect that it died and restart' monitor doesn't help you there.
    What you have to do is have a program which actually goes in and uses
    the service for real.  e.g. for a web server a program which connects
    to it every minute and retrieves the most complex CGI'd page it
    serves out.  That's the sort of monitoring we need... not this simple
    it-dies-and-we-restart stuff.  Service corruption is the far more likely
    scenario these days.

    And please, Dan, stop trying to compare generic UNIX systems to RTOSes 
    and dedicated custom turnkey systems.  Those systems run dedicated,
    heavily maintained software, whereas you are running run-of-the-mill
    third party software (as are we).  You can hardly expect the same level
    of reliability from a pot-luck dinner as you can from a carefully
    prepared meal.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>