Jump to content

EPIC FAIL with Intel Matrix RAID


Recommended Posts

So you think Intel's Matrix RAID is the greatest thing since bread and butter? So did I until it epic failed on me.

Intel is a name I've known and trusted for years. I've never doubted any of their new technologies as past experience has given me no reason to second guess any of their products. I've literally sold and installed hundreds of systems using Intel ICH based RAID 0 and have never had even a single problem. I had a single customer who had problems with an Intel ICH based RAID 1 array that would occasionally degrade on reboot which, in it's isolated occurance, I had simply tossed up to a bad SATA cable or random (un)luck of the draw.

Matrix RAID, for those who aren't aware, is a beautiful technology that Intel invented back with ICH6R in the early turn of the century. It allows you to have two RAID volume on the same physical disk array, a RAID 0 volume for performance and RAID 1 volume for data integrity for example, without independant physical disks for each volume.

post-27580-1228078698_thumb.jpg

My original setup was dual 74GB raptors in RAID 0 and my data was spread out throughout two individual 500GB drives which were not in RAID. Hard drive technology and size has evolved so much in the past several years that a single 640GB drive outperforms a 74GB RAID 0 array in terms of transfer speeds and almost equates it in terms of seek times. Obviously with prices as low as $75 per 640GB drive, it was making less and less sense to keep my 74GB drives regardless of how much they're still worth. The performance advantage has obviously disappeared.

Having purchased the 640GB drives, I didn't quite see the point of running them in RAID 0 alone as I can't imagine anyone needing a 1.3TB system drive. Not even the biggest games would ever require that much space to install. I didn't want to run them in RAID 1 because I wanted the performance boost of RAID 0. Seeing that I was already having a problem with the sheer quantity of SATA connections in use (after all, I do have 4 SATA DVD-RWs) combined with the fact that I have a motherboard that supports Matrix RAID, it made sense to me to implement the first 100GB of each drive as RAID 0 and the remaining 540GB as RAID 1. This gave me a 200GB system drive with the performance advantage and 540GB of secure space to keep my data.

Once the new setup was in place, I transfered all my data off of my two 500GB drives onto my RAID 1 array since I had enough space to accomodate it all and I've been running this new setup for a while now (few months) without any problem. Recently, I had a customer who needed 500GB drives in a rapid time frame so I emptied the ones I had and sold them to him. This left me without any backups of the data I have. I wasn't worried since I am running the data volume in RAID 1, what's better then a mirror of the data on a second drive? Nothing would ever give me any indication on how dangerous of a fire I was playing with.

Saturday morning, my computer received a small accidental bump. Wasn't much, but it was enough to cause Windows to crash. Upon rebooting the data array was marked as degraded with one of the drives marked as having an error. "Okay, I'll just rebuild it.", I figured. Nothing to it. 40% into the rebuild, the RAID manager balks that an error occured and it can't finish the rebuild. Tried it three times in a row, same error. The only way to fix it is to apparently delete the volume and recreate it. So I tried naviguating the drive to make sure my data is still intact, most of it seems okay but there's some folders I can't access because Windows starts erroring with a DEVICE IO ERROR. Now I was starting to get concerned. I figured maybe since there's a problem with one of the drives and it's still active, maybe that's why I'm getting a bunch of device errors. With RAID 1 all my data's mirrored anyway, isn't it?

So I shut down my computer, unplugged the drives, plugged in one of my now unused 74GB raptors and installed a fresh copy of Vista. I then shutdown my computer, plugged in the working drive and left the affected drive unplugged. I booted into the new Vista installation and now both volumes are marked as Failed. A fail on the RAID 0 volume is expected, half the volume is unplugged but isn't RAID 1 supposed to allow for a single drive failure? I check for my data partitions and nothing. No partitions. I shutdown, plug in the affected drive, boot in the new Vista installation again, the RAID 0 volume shows up as Normal but the RAID 1 volume still shows failed. So I checked the RAID manager, marked the affected drive to normal and try rebuilding it again. Same error at 40% and still no partitions. I shutdown the computer, booted into the Vista installation I had on the RAID 0 volume, same problem, still no partitions. I shutdown, switched the controller to IDE mode instead of RAID, installed Vista on a single 74GB drive then tried a dozen undelete and recovery programs, still no partitions, no data.

How can this happen? Isn't RAID 1 supposed to mirror the data from one drive to another? Where are the partitions? Why would the data disappear in the case of a missing drive? What in the world is going on? From what I can gather, it seems that the RAID 1 volume that the ICH controller creates is not a true RAID 1 volume. Further more, a document discussing their Rapid Recover Technology on the ICHR based controllers which states "The recovery volume can be the only volume on a system.", it would seem to indicate that mixing RAID types on a single pair of drives creates a pseudo-RAID 1 array with the crucial volume information created on the master drive only. Either way, this left me with years of my work, my accounting, pictures of my son and hundreds of gigabytes of unrecoverable data.

At the end of my desperation, I stumbled upon a little piece of software aptly called Raid Recovery. The sales pitch sounded almost too good to be true. With nothing left to lose, I installed it and tried to see if it could help. It was quickly off to a rocky start since it couldn't get past the initial disk verification stage when the application loads. A quick check in Process Monitor showed that it was stuck in a sector loop on the failed drive that kept returning IO ERRORs.

I shutdown my computer once again, changed the controller to IDE mode and went into the new single drive Vista installation I made previously. This time around, the software loaded quickly and I was presented with the Raid Recovery volume. In the matter of a few clicks of the mouse, a Virtual RAID array had been constructed within the software with a view of the volumes on my two drives. A quick scan of the virtual RAID volumes revealed my partitions. A few clicks later and I was watching a progress bar and a rapidly increasing file count. I don't think I've ever installed a single piece of software in my life that has ever lifted as much of a weight off of my heart as this one has. Several hours later and unfortunately $500 CAD later ($325 for the software), I'm now watching all of my files recover to an external hard drive ($175 for the drive and case) while I write this post.

It's a lesson with a steep fine, but not even the largest of costs can compare to the price to pay for losing all of the valuable data I could have permanently lost. Many of it could have been reconstructed with months of sleepless nights but nothing could ever replace the lost photos and memories that some of the data carries. Would I still trust Intel's implementation of RAID 0? Sure. Would I ever trust Intel's implementation of RAID 1 again? Never. There's a reason reliable RAID controllers such as those offered by 3ware cost several hundreds of dollars and this is living proof of why. Hopefully, my experience can help prevent someone else from living the same nightmare I'm cleaning up.

Link to comment
Share on other sites


There's a reason reliable RAID controllers such as those offered by 3ware cost several hundreds of dollars

For the most part, the extra cost comes from the processor that does the XOR'ing (to have good speeds in RAID5/6 without high CPU load)... But yes, they do have more features (like staggered spinup), and the software can be better too. You could do software RAID0 or RAID1 using the OS' mechanisms and not run into those issues, without the extra cost of those fancy cards. Those cards are great, until they fail. Then it's just a different nightmare. A different card won't recover your old array, and sometimes the exact same card but with a different firmware won't either. And when it fails a few years down the line, it can be real hard to find one of the exact same card to recover your data. That's when it's not a PERC from some very well known manufacturer, who insists on doing a test that will actually wipe your array clean, just to ensure it's the controller that failed! (been there, done that, wasn't exactly fun). Still gotta have offsite backups :(

I'm only using the ICH9R in RAID0, so no worries. And all my important docs are also backed up every once in a while, on an external drive, stored at my dad's place like 200KM away (in case of theft, fire or what not)

Edit: In fact, I kind of wish there were some inexpensive 12 port SATA cards, RAID or not (with staggered spinup), at a half-decent price ($300 perhaps). Don't need a fancy processor on it, just a lotta ports.

Edited by crahak
Link to comment
Share on other sites

I didn't know about this. I thought it was called matrix because it allowed mixing of IDE and SATA. And creating RAID sets from partitions surely doesn't look bulletproof (for one, adding another layer of complexity = asking for trouble).

GL

Link to comment
Share on other sites

... a beautiful technology that Intel invented back with ICH6R in the early turn of the century.

"early turn of the century" just about sums it up for me, in this cautionary tale. Looking back, it is clear that it was a compromise solution at a time when disk prices were way above what we have now.

Link to comment
Share on other sites

Wow what a story, Im so glad you have all your data back, it would be terrible to loose all your pictures and home videos!

Thanks for sharing it with us :)

Unfortunately, I've recovered only about 20% of my data. The rest is corrupt. Seems like the RAID controller corrupted my data in 512KB stripes (not sure what the hell happened there). Any files that are exactly 512KB or less (to the byte) are fine, anything larger (all my photos of my son, my software and my music for example) are corrupt. The first 512KB is fine, the next 512KB is not, the next 512KB is fine, and so on...

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...