"The future of NetBSD" by Charles M. Hannum
Bill Hacker
wbh at conducive.org
Sat Sep 2 04:10:35 PDT 2006
Matthew Dillon wrote:
:On Thu, Aug 31, 2006 at 09:58:59AM -0700, Matthew Dillon wrote:
:: that 75% of the interest in our project has nothing to do with my
:: project goals but instead are directly associated with work being done
:: by our relatively small community. I truely appreciate that effort
:: because it allows me to focus on the part that is most near and dear
:: to my own heart.
:
:Big question: after all the work that will go into the clustering, other than
:scientific research, what will the average user be able to use such advanced
:capability for?
:
:Jonathon McKitrick
I held off answering because I became quite interested in what others
thought the clustering would be used for.
Lets take a big, big step back and look at what the clustering means
from a practical standpoint.
There are really two situations involved here. First, we certainly
can allow you to say 'hey, I am going to take down machine A for
maintainance', giving the kernel the time to migrate all
resources off of machine A.
But being able to flip the power switch on machine A without warning,
or otherwise have a machine fail unexpectedly, is another ball of wax
entirely. There are only a few ways to cope with such an event:
(1) Processes with inaccessible data are killed. High level programs
such as 'make' would have to be made aware of this possibility,
process the correct error code, and restart the killed children
(e.g. compiles and such).
In this scenario, only a few programs would have to be made aware
of this type of failure in order to reap large benefits from a
big cluster, such as the ability to do massively parallel
compiles or graphics or other restartable things.
(2) You take a snapshot every once in a while and if a process fails
on one machine you recover an earlier version of it on another
(including rolling back any file modifications that were made).
(3) You run the cpu context in tandem on multiple machines so if one
machine fails another can take over without a break. This is
really an extension of the rollback mechanism, but with additional
requirements and it is particularly difficult to accomplish with
a threaded program where there may be direct memory interactions
between threads.
Tandem operation is possible with non-threaded programs but all
I/O interactions would have to be synchronization points (and thus
performance would suffer). Threaded programs would have to be
aware of the tandem operation, or else we make writing to memory
a synchronization point too (and even then I am not convinced it
is possible to keep two wholely duplicate copies of the program
operating in tandem).
Needless to say, a fully redundant system is very, very complex. My
2-year goal is NOT to achieve #3. It is to achieve #1 and also have the
ability to say 'hey, I'm taking machine BLAH down for maintainance,
migrate all the running contexts and related resources off of it please'.
Achieving #2 or #3 in a fully transparent fashion is more like a
5-year project, and you would take a very large performance hit in
order to achieve it.
But lets consider #1... consider the things you actually might want to
accomplish with a cluster. Large simulations, huge builds, or simply
providing resources to other projects that want to do large simulations
or huge builds.
Only a few programs like 'make' or the window manager have to actually
be aware of the failure case in order to be able to restart the killed
programs and make a cluster useful to a very large class of work product.
Even programs like sendmail and other services can operate fairly well
in such an environment.
So what can the average user do ?
* The average user can support a third party project by providing
cpu, memory, and storage resources to that project.
(clearly there are security issues involved, but even so there is
a large class of problems that can be addressed).
* The average user wants to leverage the cpu and memory resources
of all his networked machines for things like builds (buildworld,
pkg builds, etc)... batch operations which can be restarted if a
failure occurs.
So, consider, the average user has his desktop, and most processes
are running locally, but he also has other machines and they tie
into a named cluster based on the desktop. The cluster would
'see' the desktop's filesystems but otherwise operate as a separate
system. The average user would then be able to login to the
'cluster' and run things that then take advantage of all the machine's
resources.
* The average user might be part of a large project that has access to
a cluster.
The average user would then be able to tie into the cluster, see the
cluster's resources, and do things using the cluster that he could
otherwise not do on his personal box.
Clearly there are security issues here as well, but there is nothing
preventing us from having a trusted cluster allow untrusted tie-ins
which are locked to a particular user id... where the goal is not
necessarily to prevent the cluster from being DOSed but to prevent
trusted data from being compromisable by an untrusted source.
* The average user might want to tie into storage reserved for him in
a third party cluster, for the purposes of doing backups or other
things.
I'm sure I can think of other things, but that's the jist from the
point of view of the 'average user'.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
Any of the above approaches the impossible to do *directly* in software alone.
The best one can do is 'virtualize' it by making things 'transactionally aware'
from program-counter outward, so there are rollback-try-again-from points
(snapshots not only of the fs, but of the program counter, registers, stack,
pipeline, cache, etc...).
However, ALL of the above can be and *have been*, done in hardware alone.
- But there was both a cost and performance hit we neither want to accept, nor -
given the relative reliability of modern hardware - *need* to accept.
'back in the days...' when even a half-adder or register was a board full of
discrete components, even the program counter and registers were either in NV
RAM, or could be quickly copied to it by a firmware watchdog.
When said storage (delay lines, drum, mag core) was dual-ported (or more..)
a second or subsequent 'CPU' could pick up the PC, registers, stack, etc. - and
carry on from the exact point of the halt.
This was complex and expensive on Whirlwind II (NORAD SAGE, AN/FSQ-7, using
vacuum tubes) - requiring massive manual action.
It was automatic (8 'bi-cycles' ~ 2 seconds) with the advent of Burroughs D825
(Atlas Missile, NORAD BUIC, AN/GSA-51, using transistors).
It became 'civilianized' in the likes of the General Automation SPC-12
power-fail-restart circuit (~ 1/15th to 1/30th of a second). Even without
dial-port core, the box would come back online when power was restored and
continue program execution from exactly where it had been when stopped.
DEC, IBM, HP, Tandem, and others have long offered commercial systems with high
fault tolerance - both interactive (complex), and batch-mode (easier).
None of these, however, came anywhere near as close to what Matt has specified
as those old Military and FAA telecoms (Harris) systems did.
But the reason was not lack of a means. It was, and remains, lack of economic
justification.
Sir Harry Ricardo was famous for a maxim that had to do with internal combustion
engine design:
"Any mechanism should be as complex as it needs to be to do its job well."
Hacker adds: "..and not one damn bit more!"
Fly-by-wire aircraft, civilian or military, for example, take an entirely
different approach.
Multiple parallel systems are doing the same task, some 'for a grade' the others
for 'practice' - live, not awaiting start-up - waiting to be handed the baton so
that the programs they are already running simply now have actual control of the
mechanicals. Synchronization is far easier that way, and the redundant equipment
is a practical necessity in any case. Hard to 'hot swap' a module at 38,000 feet
- or do so fast enough to avoid a missile or an encounter with cumulonimbus
granitum.
But when a 'commodity' PC MB can run OS/2 or *BSD for months or years w/o a
failure, when an off-the-shelf PowerBook can make more than one round-the-world
trip a year for years - yet be powered-down ONLY when airline security rules
require it, there is just not a huge demand for clustering at the consumer level.
Nor was there ever such even for Banking, POS and the like. The systems are
simply designed to be both fault-tolerant, and manually recoverable.
Note that Beowulf clusters have 'worked' for ages - but have hardly taken over
the planet. Too little justification.
Likewise Google's approach: (Tens of?) thousands of commodity 'modules'.
Failure? Back off and try another server.
Fast? Fast enough. And no more.
Near-real-time? Not even close by 'puter timekeeping standards!
Complex? Can't be *too* damn complex. They haven't waited on DragonFly!
;-)
All that said, there is a value in what DFLY is attempting.
I just don't see end-users thinking of it as 'clustering'.
Rather, like the original multiple-parallel route ArpaNet, or a GSM mobile
migrating across 'cells' as one drives across town - the user should be able to
attach and his 'environment' simply go off, discover, and negotiate for,
best-currently-available supplier for whatever resources are needed at the moment.
'On-the-fly', automatic, totally transparent to the user, 'visible', but not
'limiting' to the sysop.
More like 'roaming' than clustering.
We expect it of cellphones. We expect it of browsers - particularly search engines.
DFLY could be a better platform for many of these things than what is out there
now. And it is a huge and growing market.
That's what it is good for, IMNSHO.
YOMD,
Bill Hacker
Note that
More information about the Users
mailing list