Plans for 1.8+ (2.0?)

Chris Csanady csanady at
Tue Feb 20 08:23:46 PST 2007

On 2/14/07, Simon 'corecode' Schubert <corecode at> wrote:
Chris Csanady wrote:
> For example, consider a 32kB filesystem block.  Divide it into 4kB
> sub-blocks, and compute 3 4kB ECC blocks.  Now, distribute those 11
> blocks over 11 separate nodes.  Any three nodes can fail, plus space
> overhead is only 38% in this case.  To provide the same guarantee with
> mirroring would carry a 300% overhead.  While mirroring may be
> acceptable in terms of disk space, network I/O will likely be a
> problem.
How do you save on network IO there?
In the case I presented, less than one third of the bytes need to be
transferred.  For N-way mirroring, the network cost approaches a
factor of N more.  Also, the larger the cluster, the larger N must be.
The number of network I/O's is less significant, and will not
necessarily translate into more disk I/O's if the data is laid out
intelligently.  Furthermore, if you already have open connections
between all cluster nodes, there isn't really any extra overhead at
all.  In the end, the data is sent in MTU sized chunks, and striping
it across a handful of connections is barely any extra work.
You have to query 8(!) boxes to retrieve one block.  Okay, you might choose 8 out of 11, but that's still a lot.  For writing, you of course have to write to all 11.
Yes, but those queries will overlap and complete faster.  Such
striping across nodes is very similar to RAID-5 or RAID-6, and offers
similar advantages and disadvantages.  The main difference is that now
the network is the limiting factor, not disk.  In any case, the
network doesn't care where the data is going, only how much it has to
If you go mirroring, you can run the complete block from one source (you can of course also interleave with a mirror).  For writing, you can use multicast/broadcast on LAN.  That makes mirrored writes as efficient as normal writes.  When you do ECC, you have to write all 138%.  If you run over WAN, you won't be able to save with multicast probably, but then your block distribution will make it really hard to get a constant stream due to massive jitter.
If you have a random read-heavy workload, mirroring definitely makes
more sense.  There are also cases where striping makes more sense.
Depending on the application, the latter may provide much better
performance at a far reduced cost.  At this point, possible
applications are the big question, and I don't think that artificially
limiting them before we even know what they are is a good idea.
How would jitter be any more of a problem?

I'm not yet convinced :)  Disk space is really cheap these days.
I am not convinced that it would be easily done, but you must argue
further to convince me that it isn't worth doing. :)  Either way, it
seems that Matt only plans to implement replication, though I hope
that he will reconsider.  If the design is flexible enough that it can
be cleanly implemented in a future version, that would be sufficient.
Disk space is cheap, yes.  Sending massive amount of data across the
network is not.

More information about the Kernel mailing list