Plans for 1.8+ (2.0?)

Wed Feb 14 15:25:40 PST 2007

On 2/13/07, Matthew Dillon <dillon at apollo.backplane.com> wrote:
    Well, as a replacement for something like RAID-5 then, yes, it
    would be doable.  Frankly, though, hard drive capacities are such
    that it is almost universally better to mirror the data now then
    to add ECC or PARITY based redundancy.  Hard drive capacities will
    increase by another 10x in the next 5 years.
Yes, I was considering it as a replacement for RAID-5.  The idea being
that, for a given filesystem block, you would divide it into
sub-blocks and compute ECC blocks.  These would then be distributed
across the cluster nodes.
For example, consider a 32kB filesystem block.  Divide it into 4kB
sub-blocks, and compute 3 4kB ECC blocks.  Now, distribute those 11
blocks over 11 separate nodes.  Any three nodes can fail, plus space
overhead is only 38% in this case.  To provide the same guarantee with
mirroring would carry a 300% overhead.  While mirroring may be
acceptable in terms of disk space, network I/O will likely be a
problem.
This flexibility does introduce a considerable overhead in the block
pointers, but I believe that it is well worth it.  Certainly for
metadata and other small blocks, you would want to mirror the data
instead.
    ECC blocks wouldn't help here.  A data integrity hash, sure, but
    not an ECC block.  Data stored on a hard drive is already ECC'd
    internally, so you just don't see the sort of correctable corruption
    over the wire any more.  The only type of bit corruption people see
    now occurs when the DMA hardware is broken (ATA most typically has
    this problem), in which case simply re-reading the data is the solution.
Sorry I was not very clear, I had intended the ECC blocks to be used
as above, not in place of an end to end checksum.  Using my previous
example, 8 or more of the blocks could be temporarily mirrored to
local cluster nodes, and then redistributed across to the remote
nodes.  (After which the superfluous blocks may be deallocated.)
This staging of data+ECC blocks may not be the best solution, it was
only a passing thought.
    One could require require synchronization to more then one physical
    media target before allowing an fsync() to return, but it the performance
    hit would be rather servere.
Yes, but it would be a nice option to have; local nodes with a fast
interconnect would mitigate the problem.  Fully redundant hardware
would be another solution, but it would be desirable to not have to
depend on that.
Chris