rc and smf

Thu Feb 24 12:51:10 PST 2005

    I think what Bill is trying to say, not very diplomatically, is
    that the truely important pieces of software out there in world
    don't rely on simple-stupid little monitoring programs to deal 
    with failures.  They do far more sophisticated tests and the consequences
    of a failure are far more robust then a worker coming in at 8:00 a.m.
    and finding that the system restarted service X at 4:00 a.m.  With
    these systems if a failure occurs, alarm bells ring, people get
    paged, and the system goes into a failsafe mode.  Sophisticated systems
    have a lot more going on then an easily restartable web server.

    I have my own example.  I designed the hardware and software for the
    telemetry system that Tahoe Donner PUD uses.  This is a medium sized 
    water district serving the Truckee, California area.  It monitors tanks,
    controls pumps, and records 20-40 data pointers on a two-minute basis
    across 35 sites 24x7.  And has done so for the last 17 years without
    a software-caused failure.

    The base stations are running FreeBSD.  They handle the UI, data
    collection, and reporting only.  The field units are running a completely
    autonomous custom designed RTOS with memory protection and a hardware
    watchdog.  They are responsible for monitoring tanks and other things,
    controlling pumps, buffering data, and sending alarm pages.  The system
    still works 100% if a base station goes down.  The boards have a 
    hardware watchdog.  The RTOS abstracts the hardware watchdog out to
    the processes running on the boards.  If any process fails to hit its
    virtualized watchdog, the OS doesn't hit the actualized watchdog,
    logs the failure, turns off the pumps, and the entire board goes through
    a hard reset.  There are multiple layers of redundancy and failsafes,
    everything from handling a blown transducer to turning off the pumps
    if a tank level gets too low (or too high) to making sure that failure
    modes from lightning strikes do not report false readings.

    What I am saying here is that when one is building a highly reliable
    system, there's a lot more to it then writing a little service restarter.

    I get the feeling, Dan, that you are trying to find a magic bullet to
    solve these problems.  No such bullet exists, believe me.  It certainly
    isn't this 'overcommit' stuff.  It isn't an auto-restarter, not alone
    anyway.  What it is, ultimately, is running reliable software AND
    hardware and screaming bloody hell if something goes wrong, and then
    taking further action depending on the situation (e.g hard reset,
    failsafe, fallback, etc).

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>