Networked rebuild and self-healing in HAMMER2

Thu Mar 26 02:13:41 PDT 2015

Thanks Mat for the update. Excited to test Hammer2.

However, disk management (like zRAID) does not seem to be in HAMMER2
either from what you mentioned in para 2.

Hopefully, the cluster component is not limited to the localnet as in HAST.

/z

On 3/26/15, Matthew Dillon <dillon at backplane.com> wrote:
> The idea is to be able to automate it at least so long as spare nodes are
> available.  So if one had a cluster of 3 masters (quorum is thus 2 nodes),
> and 2 additional nodes operating as slaves, then if one of the masters
> fails the cluster would continue to be able to operate with 2 masters until
> the failed master is replaced.  But the cluster would also be able to
> promote one of the slaves (already mostly synchronized) to become a master,
> returning the system to the full 3 masters and making the timing of the
> replacement less critical.
>
> This alone does not really replace RAIDs.  For a very large storage
> subsystem, each node would be made up of many disks so another layer is
> needed to manage those disks.  The documentation has a 'copies' mechanism
> that is meant to address this, where redundancy is built within each node
> to handle disk failures and to manage a pool of hot replacements.  If a
> disk fails and is taken out, the idea is for there to be sufficient copies
> to be able to rebuild the node without having to access other nodes.  But
> if for some reason there is not a sufficient number of copies then it could
> in fact get the data from other nodes as well.
>
> For smaller storage systems the cluster component is probably sufficient.
> But for larger storage systems both the cluster component and the copies
> component would be needed.
>
> One important consideration here is how spare disks or spare nodes are
> handled.  I think it is relatively important for spare disks and spare
> nodes to be 'hot' ... that is, fully live in the system and useable to
> improve read fan-out performance.  So the basic idea for all spares (both
> at the cluster level and the copies level) is for the spares drives to be
> fully integrated into the filesystem as extra slaves.
>
> Right now I am working on the clustering component.  Getting both pieces
> operational is going to take a long time. I'm not making any promises on
> the timing.  The clustering component is actually the easier piece to do.
>
> -Matt
>
>
> On Wed, Mar 25, 2015 at 3:12 AM, PeerCorps Trust Fund <
> ipc at peercorpstrust.org> wrote:
>
>> Hi,
>>
>> If I understand the HAMMER2 design documents, one of the benefits that it
>> brings is the ability to rebuild a failed disk using multiple networked
>> mirrors? It seems that it also uses this capability to provide data
>> healing
>> in the event of corruption.
>>
>> If this is the case, are these processes transparent to the user based on
>> some pre-defined failover configuration, or must they be manually set off
>> in the event of a disk failure/corruption?
>>
>> Also, would RAID controllers still be necessary in the independent nodes
>> if there is sufficient and reliable remote replication? Or could a
>> HAMMER2
>> filesystem span the disks in a particular node and have the redundancy of
>> the remote replication provide features that otherwise would come from a
>> RAID controller?
>>
>> Thanks for any clarifying statements on the above!
>>
>> --
>> Mike
>>
>