RAID 1 or Hammer

Wed Jan 14 00:03:46 PST 2009

On 13. jan.. 2009, at 04.15, Matthew Dillon wrote:

   I've seen uncaught data corruption on older machines, but not in  
the
   last few years.  Ah, the days of IDE cabling problems, remembered
   fondly (or not).  I've seen bad data get through TCP connections
   uncaught!  Yes, it actually does happen, even more so now that OS's
   are depending more and more on CRC checking done by the ethernet  
device.
. ..
Modern (meaning anything with an ATA or SCSI controller in it) drives  
will do so much error checking and recovery that the time between  
externally noticeable failures and total breakdown will be very short.

I have a number of 7-8 years old hand-me-down IBM Netfinity servers to  
use for testing purposes, and the combination of the processing done  
by the ServeRaid controllers and the Datastar ultra-320 drives makes  
it next to impossible for an error to slip through to the operating  
system. I will probably find out soon enough how the eventual  
breakdown happens, I have a yellow warning light on on a drive for  
about half a year now on a system I'm stress testing. Does not help to  
have hot-swappable drives when you have run out of spares...

I still have had errors noticed by JFS or ReiserFS, but they have not  
been caused by disk problems. On desktop systems, one of my first  
suspects will be power supplies and bad capacitors on the motherboard.  
Another suspect is software bugs, and on the servers that is the most  
plausible.

--
Bjørn Vermo
Core networking
Opera Software ASA