Oliver.HH

January 31, 2009

Didn't see that one - I was Googling for Dell, HP, Seagate, firmware etc. and didn't get anything intresting back. I see the Dell forum thread you posted was actually started by Oliver.HH and he subsequently posted a reply with Dell's firmware fix. This is apparently just for the ST3500620AS. What's up with that, Oliver.HH? It almost looks like they had no idea about the problem until recently. Did Dell support have anything to say?

I became aware of the problem by the media and found out that Seagate's support website offered no remedy for OEM drives while the Dell website had no information about the problem at all. I contacted Dell and Seagate support to find out whether DE12 firmware was affected. Seagate provided only an automated answer not related to my case while Dell support was not aware of the issue at that time. Then Dell's fix quietly appeared on their website while German support staff still did not have any further information. However, I do have a support contact who tries to help as much as possible. He is now attempting to find out whether the published Dell fix is really the correct one (I'm not sure as there might be a BRINKS/MOOSE confusion).

Looking at the HP site, I can only find a single unanswered post here regarding the problem.
On one hand, I can't believe this problem is 'as big' with OEM drives considering a single post on either forum about the problem. On the other hand, it wouldn't be hard to imagine Dell or HP treadmill support farms missing such a widespread problem - they just send out another drive under warranty whenever anyone calls about a bricked drive and the OEMs return the old one to Seagate for credit. With all the drives they have out there, you would expect them to have recognized the problem months ago and worked out the firmware issues with Seagate. Is it possible that they simply have not recognized a pattern yet? Or is it an extremely rare issue on OEM drives?

I guess what we're seeing here are huge delays in corporate information pipelines. My impression is that many OEM customers aren't even aware that they've got a Seagate drive in their PCs, let alone that there's a firmware bug out there. While drive failures happen, the reason may not be diagnosed correctly for some time so it may just be too early for these huge support organisations to discover an unusual pattern of failures.

Initially, I could not tell whether my drive was affected so my first question was whether Dell's DE12 firmware was based on Seagate's SD15 (or another affected version). Dell support did not have that information. So I read the manufacturing date from my drive's label (September) and compared that to the manufacturing dates of drives which had already failed (from the fail/fine thread in this forum). My impression was that my firmware had a high probability of being derived from the buggy ones and this turned out to be true. In contrast, I've read statements on the web where people just compare the firmware version DE12 to the versions confirmed by Seagate as buggy and then incorrectly deduct that their firmware is OK (try googling "7200.11 +DE12").

By the way, Dell has another fix on their site for ST3750630AS and ST31000340AS drives.

January 30, 2009

Many of you are still making outrageous statements about the depth of the problem.

I'd say some of us made attempts at judging the affected drive population. They put all their numbers and assumptions on the table for further discussion.

Try getting the exact number of 7200.11 disks seagate shipped

That's not a published number, right? Are you saying this simply to disparage any other attempt to estimate that number?

So think about how many QC test stations there are on the floor and consider that it is more likely that only one of them was configured to leave the trigger code on the disks.

Now you're making wild guesses without any factual basis. You don't know the percentage of test stations writing the "trigger code". That's a number Seagate didn't dare to publish so far and that might be for a reason.

A “lot” of people are having this issue? Millions? I don’t see Dell, HP, SUN, IBM, EMC, and Apple making press releases about how Seagate burned them. You don’t think apple would drop Seagate in a heartbeat if they felt Seagate had a real-world, high-risk problem?

Pure speculation. You claim to be a technical expert but you're making assumptions based on corporate psychology. In addition, you're ignoring two little facts: (1) OEMs may be legally responsible for damages incurred by their customers. (2) There are not that many disk drive manufacturers around that a large scale product buyer would light-heartedly agree to reduce the number of competitors.

The only way to explain the quiet from the PC vendors is that the risk is profoundly low.

That's the only way you can imagine. We might or might not agree. Anyway, it's probably just too early to tell.

So all of these other vendors HAD to have known about the problem from the beginning. It would not be unreasonable for them to also receive the complete lists of affected serial numbers (But I am not saying they were given the list as fact, it is my opinion that they were given lists of the affected drives that Seagate shipped them).

You still believe this though Seagate took several attempts to publish a working online serial number check?

Here is a nice little post that shows you that the 7200.11 disks you all have are "rated" for only 2400 hours use per year.

You are misstating the facts. Seagate simply states the usage patterns employed for AFR and MTBF calculations (2400 power-on-hours, 10,000 start/stop cycles). That does not mean at all, that desktop drives have a higher probability of failure when used 4800 hours per year or any other number for that matter. You cannot tell. Seagate didn't publish data for alternative usage patterns. So you're the one spreading FUD here.

BTW, in some respects server disk drives operating 24/7 can have a weaker design compared to desktop, let alone notebook drives: They don't need to withstand the high number of start/stop cycles. So higher price point doesn't necessarily mean more robust design for every usage scenario.

January 29, 2009

Still, not entirely my point, it's a bit hard to explain myself.

OK, I hope now I've got it !

You're saying that there are certain logging patterns, which might decrease the probability of hitting the critical counter values, right?

If so, I'd say, it's entirely possible, though unlikely that such patterns exist. But certainly, we cannot know for sure.

If there is some variance in the number of log entries written per power cycle, the probability of drive failure should be along the curves already presented. The lesser the average number of log entries, the higher the chance of failure, but the overall magnitude does not change much.

On another topic: I happen to own a Dell OEM drive (ST3500620AS). It currently runs Dell's OEM firmware DE12. Dell has issued an updated version DE13 to address the Barracuda bug, but the update's batch file calls the updater's executable with a parameter "-m BRINKS". In contrast, Seagate's SD1A firmware update for that drive is called "MOOSE...". What happens if the Dell folks inadvertently published the wrong version and I incorrectly apply a BRINKS firmware to a MOOSE drive? Will it just stop working or will it get even worse (silently and slowly altering data, for example)?

January 29, 2009

To get to 320 in such a system, the "initial" address x would have to be x+3*n=320 as a function of n power cycles, i.e. only values satisfying x=320-3*n, from the bottom:
317 as opposed to 319 - 318 -317
314 as opposed to 316 - 315 -314
311 as opposed to 313 - 312 -311
....
thus reducing the probabilities to 1/3 of what calculated for the "single event" addition.

That's certainly true for the chance of hitting when you are near the boundary. But if you consider that you are approaching the boundary with three times the speed and there is always a next boundary to hit (initially at 320, then at every multiple of 256), you are getting close to those boundaries three times as often.

In the long run, that amounts to 1/3 (chance of hitting when near) * 3 (frequency of being near the boundary) = 1, so the overall probability would be the same.

Thanks for the graph! Should help people decide whether to participate in the game ;-).

January 29, 2009

All that matters is that the event counter changes at all from power-on to power-off. It does not matter whether it increases by 1, or by 50 or by any other value as long as such values are equally probable.
But the events are hardly equally probable. It's much more likely that you're going to get a very small number each power cycle. The chances of dozens or hundreds of entries each power cycle are almost non-existant unless your drive is hosed to begin with.

You are right. My statement was an oversimplification.

And consider this: if the log incremented by EXACTLY one each power cycle (I don't know if that's even possible), what's the probability an (affected) drive will fail? It's 100%. It will fail with certainty because it WILL occur on the 320th power cycle.

If you assume that the log is initially empty (event counter at 0), that's certainly true. To be more precise, the probability of the drive failing would be 0% on the first 319 days, jumping to 100% on day 320.

Figuring out the probability of failure on any single power cycle isn't really useful. The question most 7200.11 owners have is: What are the chances my drive will fail AT ALL in the next year or two?

Absolutely. That's in line with what I intended to point out. While the probability of anything failing is 100% in an infinite number of years, these drives are very likely to fail in their first year of service. My calculation estimates a 76% chance of failure within a year. Real numbers might be even worse.

I'v quickly calculated the chances of drive failure given a certain average number of log entries per power cycle. Again I've ignored the initial 320 boundary, as the log might not be empty when a drive ships. So for 5 entries on average we have about 50 days of 0% failure probability (as the log fills up to its 256 boundary), then a 20% chance of failure on day 51. The chances of a drive being still alive after that are thus 80%. On day 102 there is also a 20% chance of failure, making a total chance of 64% for a drive being alive on day 102 (it has two 20% chances to die until then). And so on. Given 5 log entries on average over one year, the failure probability would be 79%. For 3 log entries on average over one year, the failure probability would be 80.2%.

I'm wondering whether Seagate can really say with confidence that "ased on the low risk as determined by an analysis of actual field return data, Seagate believes that the affected drives can be used as is." (see KB article).

January 28, 2009

Yep , and we don't even have a clear idea on WHICH events are logged and HOW MANY such events take place in an "average powered on hour".

True, but we don't have to know. The probability of a drive failing is the same as long as at least one event is logged per power cycle.

If, as it has been hinted/reported somewhere on the threads, a S.M.A.R.T. query raises an event that is actually logged, we will soon fall in the paradox that the more you check your hardware status the more prone it is to fail.....

No, the chance of a drive failing due to this condition is zero unless it is powered off.

All that matters is that the event counter changes at all from power-on to power-off. It does not matter whether it increases by 1, or by 50 or by any other value as long as such values are equally probable.

January 28, 2009

Another attempt to estimate the probability of a drive failing...

Given the "root cause" document posted here by sieve-x, this is what we know:

A drive is affected by the bug if it contains the defective firmware and has been tested on certain test stations.
An affected drive will fail if turned off after exactly 320 internal events were logged initially or any multiple of 256 thereafter.

We don't have the details on how often exactly the event log is written to. Someone mentioned that it's written to when the drive initializes on power-up (though I don't remember the source). If that's true, we would have one event per power cycle plus an unknown and possibly varying number in between.

Given that, the probability of an affected drive being alive after one power cycle is 255/256. After two power cycles it's 255/256 * 255/256. After three power cycles it's (255/256)^3. And so on. While the isolated probability of the drive failing on a single power-up is just 0.4%, the numbers go up when you calculate the probability of a drive failing over time.

Let's assume, a desktop drive is power cycled once a day. The probability of an affected drive failing then is:

0.4% for 1 day

11.1% over 30 days

29.7% over 90 days

76.0% over 365 days

Obviously, I'm ignoring the fact that initially a higher number of events (320) must be logged to trigger the failure. Anyway, this would not change the numbers substiantally and the initial number might be even lower than 256 depending on the number of events logged during the manufacturing process. I'm also ignoring the number of events written while the drive is powered on, as it should not affect the overall probability.

Sign In

Oliver.HH

Posts

Joined

Last visited

Donations

Country

Content Type

Profiles

Forums

Events

Posts posted by Oliver.HH

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Activity

Browse