VFS ROADMAP (and vfs01.patch stage 1 available for testing)
Martin P. Hellwig
mhellwig at xs4all.nl
Tue Aug 17 06:33:55 PDT 2004
Janet Sullivan wrote:
How would such a system gracefully deal with hardware failures? What
happens when one of the nodes in the cluster dies?
In my opinion hardware failure is always problematic.
If you want to have high availability then you always and certainly have
to deal with fail over systems like duplicated hardware. (RAID, dual
power and a identical stand-by system for fail-over).
The problem with this is when nothing fails you got a lot of CPU power
doing nothing, a huge waste of resources.
So this is quite interesting problem, on one side you wish to use all
available performance and on the other hand you want fail-over.
The problem is also organizational, you can not have more then 1 leader,
if you have more you have to agree on a leader among those leaders.
Which is a problem because that's a single point of failure just waiting
to be happening. So there is no easy way to prevent performance loss
there, the best imaginable way would be:
One DVM master, which only controls resource spreading and dynamically
points out DVM backups.
2 DVM backups, is a live sync with the master and each other, they
perform normal distributed tasks too.
When failure occurs:
Scenario 1, A DVM backup is lost
View from Master & Backup:
Master detects that a backup DVM is not accessible any more,it asks the
other backup DVM if it accessible via it, if not the node can be seen as
lost and a other node is taken from a list and pointed out as a DVM
backup and resumes normal operation.
If it has connection to the lost backup via the normal backup node, it
waits one time-out (5 minutes or so) and checks if he has direct
connection again if he has then it resumes normal operation if not then
it proclaims the Backup node as lost and a other one is taken in place.
On regular bases say every 5 minutes both DVM backups try to connect to
the lost DVM backup, if a connection is made (this can also be a other
list node proxying the message from the inaccessible node), the lost DVM
backup up is told to loose its status as DVM Management node and is put
back on the list if it is fully accessible and functional again.
If it is not fully functional the machine will be on hold until it is
fully functional again (a message send to the administrator stating the
problem).
<side not> The DVM Managing hardware priority list is a pool of
_servers_ which are ment to be on permantly, and by this have the
capability to be a cluster manager, priority is give to hardware with a
good combination of available resource (network and cpu being more
important then disk space) and uptime with low downtime </side note>
If after 24 hours the node is still unavailable the node will no longer
be contacted and a mail is send to the administrator to find out what
the problem is. If the node appears within that time again and a contact
is initiated the node is told to loose its status as DVM Management node
and put back on the list (if fully functional) of course quite low
because it has a bad record of having long non-access-time if it is not
fully functional the same procedure will be followed as previous described.
View from lost Backup:
The node if still on, finds that he can not reach the Master and Backup
it will contact the list nodes to contact for him a DVM Management node,
if these list nodes can contact a other Master DVM Management node then
he parses the message that he drops his Management status and stays on
hold till its is fully accessible and functional.
When that requirement is met then it soliciting again to the list.
If it can't access any other list nodes, it holds it states till there
is a contactable list node to confirm its status. It could be advisable
that after a longer period of solitude (say 3 days) the lost Backup DVM
node write his changes locally and shuts down.
If the Backup has contact to other list nodes which can not confirm his
status and those list nodes can't connect to the original Master and the
other Backup, the lasting Backup proclaims follows Scenario 2.
Scenario 2, the DVM Master and one DVM Backup is lost:
The backup detects that the other DVM Management nodes are unavailable.
It queries the list nodes (again) to contact a other DVM Management node
for him. If the list nodes have access to a other DVM Management beside
himself then the DVM Backup follows scenario 1 otherwise it proclaims
itself semi-Master and waits till there are enough nodes on the DVM
Managing hardware priority list (3 nodes one for spare, two for backup)
to enlist two backups, then it proclaims itself from semi-Master to
Master and resumes operation while setting a flag that this is the x=+1
generation DVM Management.
It then follows procedure 1 for recovering of the lost Backup nodes.
Scenario 3, both backups are lost:
The Master detects that both backups are lost and tries the usual
reconnect procedure, if he has still no access it proclaims itself
backup and follows scenario 2.
Scenario 4, Master is lost.
View from backups:
Both backup nodes try to contact the master directly and via list nodes
, if the master is not accessible (give a time-out of 1 minute or so ) a
new master will be selected from the list and operation will be resumed
, the lost master which by now will be proclaimed as a lost Backup DVM
Management node, the procedure as in scenario 1 can now be followed.
Scenario 5, 2 or more Masters reconnect after a major network problems.
All DVM Management nodes are supposed to merge with with the DVM
Management nodes who has the lowest generation flag. If 2 equal
generation Flags are in collapse then the one with the least uptime
merges with the longest uptime.
---
So this is the graceful part now the part how you can do this without
interrupting services.
In my view every node within the cluster gives away a certain part of
resources. These resources are distributed and controlled by the DVM
Management nodes, the DVM Management is a service on the native system
and only the master is actively sharing the resources the other backups
are just fail-over as described above.
The DVM can abstract the resources to one (or more) virtual machines
which are installed with DragonFly (or a other BSD) with an adapted
kernel which is aware of its "virtual" state.
If you need full fail-over you configure 2 virtual machines on the
cluster although you miss performance, you still have the advantage when
there is more performance needed you just pop in more hardware on the
network (like pxe booting image with a dragonfly install pre-configured
to be a part of a cluster).
I like to compare this "future" technology with a (Dragonfly) facet eye,
although there are many facets there are only 2 eyes and these are
overlapping the eye side almost completely:-)
A nice thought is scalability, you could combine a view DragonFly
installs on the virtual hardware together as cluster again and so on and
so on.
I have the schematics quite clear on my mind I hope you guys can follow
me (if haven't bored you away anyway :-) ) because I know that my
explanation of this idea is quite badly and I am not sure which logical
failures I've done. So I hope you can point me to them.
mph
More information about the Kernel
mailing list