"The future of NetBSD" by Charles M. Hannum

Sat Sep 2 04:10:35 PDT 2006

Matthew Dillon wrote:

:On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote:
::     that 75% of the interest in our project has nothing to do with my
::     project goals but instead are directly associated with work being done
::     by our relatively small community.  I truely appreciate that effort
::     because it allows me to focus on the part that is most near and dear
::     to my own heart.
:
:Big question: after all the work that will go into the clustering, other than
:scientific research, what will the average user be able to use such advanced
:capability for?
:
:Jonathon McKitrick
    I held off answering because I became quite interested in what others
    thought the clustering would be used for.
    Lets take a big, big step back and look at what the clustering means
    from a practical standpoint.
    There are really two situations involved here.  First, we certainly
    can allow you to say 'hey, I am going to take down machine A for
    maintainance', giving the kernel the time to migrate all
    resources off of machine A.
    But being able to flip the power switch on machine A without warning,
    or otherwise have a machine fail unexpectedly, is another ball of wax
    entirely.  There are only a few ways to cope with such an event:
    (1) Processes with inaccessible data are killed.  High level programs
	such as 'make' would have to be made aware of this possibility,
	process the correct error code, and restart the killed children
	(e.g. compiles and such).
	In this scenario, only a few programs would have to be made aware
	of this type of failure in order to reap large benefits from a
	big cluster, such as the ability to do massively parallel 
	compiles or graphics or other restartable things.

    (2) You take a snapshot every once in a while and if a process fails
	on one machine you recover an earlier version of it on another
	(including rolling back any file modifications that were made).
    (3) You run the cpu context in tandem on multiple machines so if one
	machine fails another can take over without a break.  This is
	really an extension of the rollback mechanism, but with additional
	requirements and it is particularly difficult to accomplish with
	a threaded program where there may be direct memory interactions
	between threads.
	Tandem operation is possible with non-threaded programs but all 
	I/O interactions would have to be synchronization points (and thus
	performance would suffer).  Threaded programs would have to be
	aware of the tandem operation, or else we make writing to memory
	a synchronization point too (and even then I am not convinced it
	is possible to keep two wholely duplicate copies of the program
	operating in tandem).

    Needless to say, a fully redundant system is very, very complex.   My
    2-year goal is NOT to achieve #3.  It is to achieve #1 and also have the
    ability to say 'hey, I'm taking machine BLAH down for maintainance,
    migrate all the running contexts and related resources off of it please'.
    Achieving #2 or #3 in a fully transparent fashion is more like a
    5-year project, and you would take a very large performance hit in
    order to achieve it.
    But lets consider #1... consider the things you actually might want to
    accomplish with a cluster.  Large simulations, huge builds, or simply
    providing resources to other projects that want to do large simulations
    or huge builds.
    Only a few programs like 'make' or the window manager have to actually
    be aware of the failure case in order to be able to restart the killed
    programs and make a cluster useful to a very large class of work product.
    Even programs like sendmail and other services can operate fairly well
    in such an environment.
    So what can the average user do ?

    * The average user can support a third party project by providing
      cpu, memory, and storage resources to that project.
      (clearly there are security issues involved, but even so there is
      a large class of problems that can be addressed).
    * The average user wants to leverage the cpu and memory resources 
      of all his networked machines for things like builds (buildworld,
      pkg builds, etc)... batch operations which can be restarted if a
      failure occurs.

      So, consider, the average user has his desktop, and most processes
      are running locally, but he also has other machines and they tie
      into a named cluster based on the desktop.   The cluster would
      'see' the desktop's filesystems but otherwise operate as a separate
      system.  The average user would then be able to login to the
      'cluster' and run things that then take advantage of all the machine's
      resources.
    * The average user might be part of a large project that has access to
      a cluster.  

      The average user would then be able to tie into the cluster, see the
      cluster's resources, and do things using the cluster that he could
      otherwise not do on his personal box.
      Clearly there are security issues here as well, but there is nothing
      preventing us from having a trusted cluster allow untrusted tie-ins
      which are locked to a particular user id... where the goal is not
      necessarily to prevent the cluster from being DOSed but to prevent
      trusted data from being compromisable by an untrusted source.
    * The average user might want to tie into storage reserved for him in
      a third party cluster, for the purposes of doing backups or other
      things.
    I'm sure I can think of other things, but that's the jist from the
    point of view of the 'average user'.
					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>
Any of the above approaches the impossible to do *directly* in software alone.

The best one can do is 'virtualize' it by making things 'transactionally aware' 
from program-counter outward, so there are rollback-try-again-from points 
(snapshots not only of the fs, but of the program counter, registers, stack, 
pipeline, cache, etc...).

However, ALL of the above can be and *have been*, done in hardware alone.

- But there was both a cost and performance hit we neither want to accept, nor - 
given the relative reliability of modern hardware - *need* to accept.

'back in the days...' when even a half-adder or register was a board full of 
discrete components, even the program counter and registers were either in NV 
RAM, or could be quickly copied to it by a firmware watchdog.

When said storage (delay lines, drum, mag core) was dual-ported (or more..)
a second or subsequent 'CPU' could pick up the PC, registers, stack, etc. - and 
carry on from the exact point of the halt.

This was complex and expensive on Whirlwind II (NORAD SAGE, AN/FSQ-7, using 
vacuum tubes) - requiring massive manual action.

It was automatic (8 'bi-cycles' ~ 2 seconds) with the advent of Burroughs D825 
(Atlas Missile, NORAD BUIC, AN/GSA-51, using transistors).

It became 'civilianized' in the likes of the General Automation SPC-12 
power-fail-restart circuit (~ 1/15th to 1/30th of a second). Even without 
dial-port core, the box would come back online when power was restored and 
continue program execution from exactly where it had been when stopped.

DEC, IBM, HP, Tandem, and others have long offered commercial systems with high 
fault tolerance - both interactive (complex), and batch-mode (easier).

None of these, however, came anywhere near as close to what Matt has specified 
as those old Military and FAA telecoms (Harris) systems did.

But the reason was not lack of a means.  It was, and remains, lack of economic 
justification.

Sir Harry Ricardo was famous for a maxim that had to do with internal combustion 
engine design:

"Any mechanism should be as complex as it needs to be to do its job well."

Hacker adds: "..and not one damn bit more!"

Fly-by-wire aircraft, civilian or military, for example, take an entirely 
different approach.

Multiple parallel systems are doing the same task, some 'for a grade' the others 
for 'practice' - live, not awaiting start-up - waiting to be handed the baton so 
that the programs they are already running simply now have actual control of the 
mechanicals. Synchronization is far easier that way, and the redundant equipment 
is a practical necessity in any case. Hard to 'hot swap' a module at 38,000 feet 
- or do so fast enough to avoid a missile or an encounter with cumulonimbus 
granitum.

But when a 'commodity' PC MB can run OS/2 or *BSD for months or years w/o a 
failure, when an off-the-shelf PowerBook can make more than one round-the-world 
trip a year for years - yet be powered-down ONLY when airline security rules 
require it, there is just not a huge demand for clustering at the consumer level.

Nor was there ever such even for Banking, POS and the like. The systems are 
simply designed to be both fault-tolerant, and manually recoverable.

Note that Beowulf clusters have 'worked' for ages - but have hardly taken over 
the planet. Too little justification.

Likewise Google's approach: (Tens of?) thousands of commodity 'modules'.

Failure? Back off and try another server.

Fast?  Fast enough. And no more.

Near-real-time?  Not even close by 'puter timekeeping standards!

Complex?  Can't be *too* damn complex. They haven't waited on DragonFly!

;-)

All that said, there is a value in what DFLY is attempting.

I just don't see end-users thinking of it as 'clustering'.

Rather, like the original multiple-parallel route ArpaNet, or a GSM mobile 
migrating across 'cells' as one drives across town - the user should be able to 
attach and his 'environment' simply go off, discover, and negotiate for, 
best-currently-available supplier for whatever resources are needed at the moment.

'On-the-fly', automatic, totally transparent to the user, 'visible', but not 
'limiting' to the sysop.

More like 'roaming' than clustering.

We expect it of cellphones. We expect it of browsers - particularly search engines.

DFLY could be a better platform for many of these things than what is out there 
now. And it is a huge and growing market.

That's what it is good for, IMNSHO.

YOMD,

Bill Hacker

Note that