VFS ROADMAP (and vfs01.patch stage 1 available for testing)

Tue Aug 17 06:33:55 PDT 2004

Janet Sullivan wrote:

How would such a system gracefully deal with hardware failures?  What 
happens when one of the nodes in the cluster dies?
In my opinion hardware failure is always problematic.
If you want to have high availability then you always and certainly have 
to deal with fail over systems like duplicated hardware. (RAID, dual 
power and a identical stand-by system for fail-over).
The problem with this is when nothing fails you got a lot of CPU power 
doing nothing, a huge waste of resources.
So this is quite interesting problem, on one side you wish to use all 
available performance and on the other hand you want fail-over.

The problem is also organizational, you can not have more then 1 leader, 
if you have more you have to agree on a leader among those leaders.
Which is a problem because that's a single point of failure just waiting 
to be happening. So there is no easy way to prevent performance loss 
there, the best imaginable way would be:

One DVM master, which only controls resource spreading and dynamically 
points out DVM backups.
2 DVM backups, is a live sync with the master and each other, they 
perform normal distributed tasks too.

When failure occurs:

Scenario 1, A DVM backup is lost

View from Master & Backup:
Master detects that a backup DVM is not accessible any more,it asks the 
other backup DVM if it accessible via it, if not the node can be seen as 
lost and a other node is taken from a list and pointed out as a DVM 
backup and resumes normal operation.

If it has connection to the lost backup via the normal backup node, it 
waits one time-out (5 minutes or so) and checks if he has direct 
connection again if he has then it resumes normal operation if not then 
it proclaims the Backup node as lost and a other one is taken in place.

On regular bases say every 5 minutes both DVM backups try to connect to 
the lost DVM backup, if a connection is made (this can also be a other 
list node proxying the message from the inaccessible node), the lost DVM 
backup up is told to loose its status as DVM Management node and is put 
back on the list if it is fully accessible and functional again.
If it is not fully functional the machine will be on hold until it is 
fully functional again (a message send to the administrator stating the 
problem).

<side not> The DVM Managing hardware priority list is a pool of 
_servers_ which are ment to be on permantly, and by this have the 
capability to be a cluster manager, priority is give to hardware with a 
good combination of available resource (network and cpu being more 
important then disk space) and uptime with low downtime </side note>

If after 24 hours the node is still unavailable the node will no longer 
be contacted and a mail is send to the administrator to find out what 
the problem is. If the node appears within that time again and a contact 
is initiated the node is told to loose its status as DVM Management node 
and put back on the list (if fully functional) of course quite low 
because it has a bad record of having long non-access-time if it is not 
fully functional the same procedure will be followed as previous described.

View from lost Backup:
The node if still on, finds that he can not reach the Master and Backup 
it will contact the list nodes to contact for him a DVM Management node, 
if these list nodes can contact a other Master DVM Management node then 
he parses the message that he drops his Management status and stays on 
hold till its is fully accessible and functional.
When that requirement is met then it soliciting again to the list.
If it can't access any other list nodes, it holds it states till there 
is a contactable list node to confirm its status. It could be advisable 
that after a longer period of solitude (say 3 days) the lost Backup DVM 
node write his changes locally and shuts down.
If the Backup has contact to other list nodes which can not confirm his 
status and those list nodes can't connect to the original Master and the 
other Backup, the lasting Backup proclaims follows Scenario 2.

Scenario 2, the DVM Master and one DVM Backup is lost:
The backup detects that the other DVM Management nodes are unavailable.
It queries the list nodes (again) to contact a other DVM Management node 
for him. If the list nodes have access to a other DVM Management beside 
himself then the DVM Backup follows scenario 1 otherwise it proclaims 
itself semi-Master and waits till there are enough nodes on the DVM 
Managing hardware priority list (3 nodes one for spare, two for backup) 
to enlist two backups, then it proclaims itself from semi-Master to 
Master and resumes operation while setting a flag that this is the x=+1 
generation DVM Management.
It then follows procedure 1 for recovering of the lost Backup nodes.

Scenario 3, both backups are lost:
The Master detects that both backups are lost and tries the usual 
reconnect procedure, if he has still no access it proclaims itself 
backup and follows scenario 2.

Scenario 4, Master is lost.
View from backups:
Both backup nodes try to contact the master directly and via list nodes 
, if the master is not accessible (give a time-out of 1 minute or so ) a 
new master will be selected from the list and operation will be resumed 
, the lost master which by now will be proclaimed as a lost Backup DVM 
Management node, the procedure as in scenario 1 can now be followed.

Scenario 5, 2 or more Masters reconnect after a major network problems.
All DVM Management nodes are supposed to merge with with the DVM 
Management nodes who has the lowest generation flag. If 2 equal 
generation Flags are in collapse then the one with the least uptime 
merges with the longest uptime.

---
So this is the graceful part now the part how you can do this without 
interrupting services.

In my view every node within the cluster gives away a certain part of 
resources. These resources are distributed and controlled by the DVM 
Management nodes, the DVM Management is a service on the native system 
and only the master is actively sharing the resources the other backups 
are just fail-over as described above.
The DVM can abstract the resources to one (or more) virtual machines 
which are installed with DragonFly (or a other BSD) with an adapted 
kernel which is aware of its "virtual" state.

If you need full fail-over you configure 2 virtual machines on the 
cluster although you miss performance, you still have the advantage when 
there is more performance needed you just pop in more hardware on the 
network (like pxe booting image with a dragonfly install pre-configured 
to be a part of a cluster).

I like to compare this "future" technology with a (Dragonfly) facet eye, 
although there are many facets there are only 2 eyes and these are 
overlapping the eye side almost completely:-)

A nice thought is scalability, you could combine a view DragonFly 
installs on the virtual hardware together as cluster again and so on and 
so on.

I have the schematics quite clear on my mind I hope you guys can follow 
me (if haven't bored you away anyway :-) ) because I know that my 
explanation of this idea is quite badly and I am not sure which logical 
failures I've done. So I hope you can point me to them.

mph