rc and smf
Matthew Dillon
dillon at apollo.backplane.com
Thu Feb 24 12:51:10 PST 2005
I think what Bill is trying to say, not very diplomatically, is
that the truely important pieces of software out there in world
don't rely on simple-stupid little monitoring programs to deal
with failures. They do far more sophisticated tests and the consequences
of a failure are far more robust then a worker coming in at 8:00 a.m.
and finding that the system restarted service X at 4:00 a.m. With
these systems if a failure occurs, alarm bells ring, people get
paged, and the system goes into a failsafe mode. Sophisticated systems
have a lot more going on then an easily restartable web server.
I have my own example. I designed the hardware and software for the
telemetry system that Tahoe Donner PUD uses. This is a medium sized
water district serving the Truckee, California area. It monitors tanks,
controls pumps, and records 20-40 data pointers on a two-minute basis
across 35 sites 24x7. And has done so for the last 17 years without
a software-caused failure.
The base stations are running FreeBSD. They handle the UI, data
collection, and reporting only. The field units are running a completely
autonomous custom designed RTOS with memory protection and a hardware
watchdog. They are responsible for monitoring tanks and other things,
controlling pumps, buffering data, and sending alarm pages. The system
still works 100% if a base station goes down. The boards have a
hardware watchdog. The RTOS abstracts the hardware watchdog out to
the processes running on the boards. If any process fails to hit its
virtualized watchdog, the OS doesn't hit the actualized watchdog,
logs the failure, turns off the pumps, and the entire board goes through
a hard reset. There are multiple layers of redundancy and failsafes,
everything from handling a blown transducer to turning off the pumps
if a tank level gets too low (or too high) to making sure that failure
modes from lightning strikes do not report false readings.
What I am saying here is that when one is building a highly reliable
system, there's a lot more to it then writing a little service restarter.
I get the feeling, Dan, that you are trying to find a magic bullet to
solve these problems. No such bullet exists, believe me. It certainly
isn't this 'overcommit' stuff. It isn't an auto-restarter, not alone
anyway. What it is, ultimately, is running reliable software AND
hardware and screaming bloody hell if something goes wrong, and then
taking further action depending on the situation (e.g hard reset,
failsafe, fallback, etc).
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list