rc and smf
Matthew Dillon
dillon at apollo.backplane.com
Thu Feb 24 11:41:35 PST 2005
Hmm. Well, I have to say that in my opinion a service failure is a
critical bug in the application. I usually go in and fix the application
software rather then write monitoring programs for it (other then to
tell me if it has failed). Most service oriented applications fork()
on connect (a DNS cache being an exception), and those that have the
option of forking or running threaded I usually tell to fork. This
greatly narrows the amount of code that actually has to run in the
parent's connection-accepting loop and works as well as or better then a
service monitor. The proxies I wrote at BEST Internet a long time ago
all did that, and those applications never failed. Not once. Ever.
They handled millions of emails a day.
Insofar as the remaining applications go, I have seen occassional
failures and certainly failures can occur, but it isn't a 'random'
occurance. Some applications are prone to problems, some never die.
I have lost older BIND demons to corruption (not actual segfaults),
but I don't think I've had a dns failure for over two years now, and
that is plenty long enough for me to prefer having the system yell and
scream at me if it dies rather then restart and forget.
The only time a service has failed on crater.dragonflybsd.org has been
when I screwed it up myself, accidently, or when the hard drive physically
crashed. That's it. I certainly don't spend my time worrying at night
that random services might not be working!
But anyhow, back to service failures... service failures do not always
end in a crash. Take BIND for example. It is far more likely that
BIND's cache will become corrupted then for BIND to actually crash. A
simple 'detect that it died and restart' monitor doesn't help you there.
What you have to do is have a program which actually goes in and uses
the service for real. e.g. for a web server a program which connects
to it every minute and retrieves the most complex CGI'd page it
serves out. That's the sort of monitoring we need... not this simple
it-dies-and-we-restart stuff. Service corruption is the far more likely
scenario these days.
And please, Dan, stop trying to compare generic UNIX systems to RTOSes
and dedicated custom turnkey systems. Those systems run dedicated,
heavily maintained software, whereas you are running run-of-the-mill
third party software (as are we). You can hardly expect the same level
of reliability from a pot-luck dinner as you can from a carefully
prepared meal.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list