Oliver.HH
Content Type
Profiles
Forums
Events
Posts posted by Oliver.HH
-
-
I'd say some of us made attempts at judging the affected drive population. They put all their numbers and assumptions on the table for further discussion.Many of you are still making outrageous statements about the depth of the problem.
That's not a published number, right? Are you saying this simply to disparage any other attempt to estimate that number?Try getting the exact number of 7200.11 disks seagate shipped
Now you're making wild guesses without any factual basis. You don't know the percentage of test stations writing the "trigger code". That's a number Seagate didn't dare to publish so far and that might be for a reason.So think about how many QC test stations there are on the floor and consider that it is more likely that only one of them was configured to leave the trigger code on the disks.
Pure speculation. You claim to be a technical expert but you're making assumptions based on corporate psychology. In addition, you're ignoring two little facts: (1) OEMs may be legally responsible for damages incurred by their customers. (2) There are not that many disk drive manufacturers around that a large scale product buyer would light-heartedly agree to reduce the number of competitors.A “lot” of people are having this issue? Millions? I don’t see Dell, HP, SUN, IBM, EMC, and Apple making press releases about how Seagate burned them. You don’t think apple would drop Seagate in a heartbeat if they felt Seagate had a real-world, high-risk problem?
That's the only way you can imagine. We might or might not agree. Anyway, it's probably just too early to tell.The only way to explain the quiet from the PC vendors is that the risk is profoundly low.
You still believe this though Seagate took several attempts to publish a working online serial number check?So all of these other vendors HAD to have known about the problem from the beginning. It would not be unreasonable for them to also receive the complete lists of affected serial numbers (But I am not saying they were given the list as fact, it is my opinion that they were given lists of the affected drives that Seagate shipped them).
You are misstating the facts. Seagate simply states the usage patterns employed for AFR and MTBF calculations (2400 power-on-hours, 10,000 start/stop cycles). That does not mean at all, that desktop drives have a higher probability of failure when used 4800 hours per year or any other number for that matter. You cannot tell. Seagate didn't publish data for alternative usage patterns. So you're the one spreading FUD here.Here is a nice little post that shows you that the 7200.11 disks you all have are "rated" for only 2400 hours use per year.BTW, in some respects server disk drives operating 24/7 can have a weaker design compared to desktop, let alone notebook drives: They don't need to withstand the high number of start/stop cycles. So higher price point doesn't necessarily mean more robust design for every usage scenario.
0 -
Still, not entirely my point, it's a bit hard to explain myself.
OK, I hope now I've got it !
You're saying that there are certain logging patterns, which might decrease the probability of hitting the critical counter values, right?
If so, I'd say, it's entirely possible, though unlikely that such patterns exist. But certainly, we cannot know for sure.
If there is some variance in the number of log entries written per power cycle, the probability of drive failure should be along the curves already presented. The lesser the average number of log entries, the higher the chance of failure, but the overall magnitude does not change much.
On another topic: I happen to own a Dell OEM drive (ST3500620AS). It currently runs Dell's OEM firmware DE12. Dell has issued an updated version DE13 to address the Barracuda bug, but the update's batch file calls the updater's executable with a parameter "-m BRINKS". In contrast, Seagate's SD1A firmware update for that drive is called "MOOSE...". What happens if the Dell folks inadvertently published the wrong version and I incorrectly apply a BRINKS firmware to a MOOSE drive? Will it just stop working or will it get even worse (silently and slowly altering data, for example)?
0 -
To get to 320 in such a system, the "initial" address x would have to be x+3*n=320 as a function of n power cycles, i.e. only values satisfying x=320-3*n, from the bottom:317 as opposed to 319 - 318 -317
314 as opposed to 316 - 315 -314
311 as opposed to 313 - 312 -311
....
thus reducing the probabilities to 1/3 of what calculated for the "single event" addition.
That's certainly true for the chance of hitting when you are near the boundary. But if you consider that you are approaching the boundary with three times the speed and there is always a next boundary to hit (initially at 320, then at every multiple of 256), you are getting close to those boundaries three times as often.
In the long run, that amounts to 1/3 (chance of hitting when near) * 3 (frequency of being near the boundary) = 1, so the overall probability would be the same.
Thanks for the graph! Should help people decide whether to participate in the game ;-).
0 -
All that matters is that the event counter changes at all from power-on to power-off. It does not matter whether it increases by 1, or by 50 or by any other value as long as such values are equally probable.
But the events are hardly equally probable. It's much more likely that you're going to get a very small number each power cycle. The chances of dozens or hundreds of entries each power cycle are almost non-existant unless your drive is hosed to begin with.
You are right. My statement was an oversimplification.
And consider this: if the log incremented by EXACTLY one each power cycle (I don't know if that's even possible), what's the probability an (affected) drive will fail? It's 100%. It will fail with certainty because it WILL occur on the 320th power cycle.If you assume that the log is initially empty (event counter at 0), that's certainly true. To be more precise, the probability of the drive failing would be 0% on the first 319 days, jumping to 100% on day 320.
Figuring out the probability of failure on any single power cycle isn't really useful. The question most 7200.11 owners have is: What are the chances my drive will fail AT ALL in the next year or two?Absolutely. That's in line with what I intended to point out. While the probability of anything failing is 100% in an infinite number of years, these drives are very likely to fail in their first year of service. My calculation estimates a 76% chance of failure within a year. Real numbers might be even worse.
I'v quickly calculated the chances of drive failure given a certain average number of log entries per power cycle. Again I've ignored the initial 320 boundary, as the log might not be empty when a drive ships. So for 5 entries on average we have about 50 days of 0% failure probability (as the log fills up to its 256 boundary), then a 20% chance of failure on day 51. The chances of a drive being still alive after that are thus 80%. On day 102 there is also a 20% chance of failure, making a total chance of 64% for a drive being alive on day 102 (it has two 20% chances to die until then). And so on. Given 5 log entries on average over one year, the failure probability would be 79%. For 3 log entries on average over one year, the failure probability would be 80.2%.
I'm wondering whether Seagate can really say with confidence that "ased on the low risk as determined by an analysis of actual field return data, Seagate believes that the affected drives can be used as is." (see KB article).
0 -
Yep , and we don't even have a clear idea on WHICH events are logged and HOW MANY such events take place in an "average powered on hour".
True, but we don't have to know. The probability of a drive failing is the same as long as at least one event is logged per power cycle.
If, as it has been hinted/reported somewhere on the threads, a S.M.A.R.T. query raises an event that is actually logged, we will soon fall in the paradox that the more you check your hardware status the more prone it is to fail.....No, the chance of a drive failing due to this condition is zero unless it is powered off.
All that matters is that the event counter changes at all from power-on to power-off. It does not matter whether it increases by 1, or by 50 or by any other value as long as such values are equally probable.
0 -
Another attempt to estimate the probability of a drive failing...
Given the "root cause" document posted here by sieve-x, this is what we know:
- A drive is affected by the bug if it contains the defective firmware and has been tested on certain test stations.
- An affected drive will fail if turned off after exactly 320 internal events were logged initially or any multiple of 256 thereafter.
We don't have the details on how often exactly the event log is written to. Someone mentioned that it's written to when the drive initializes on power-up (though I don't remember the source). If that's true, we would have one event per power cycle plus an unknown and possibly varying number in between.
Given that, the probability of an affected drive being alive after one power cycle is 255/256. After two power cycles it's 255/256 * 255/256. After three power cycles it's (255/256)^3. And so on. While the isolated probability of the drive failing on a single power-up is just 0.4%, the numbers go up when you calculate the probability of a drive failing over time.
Let's assume, a desktop drive is power cycled once a day. The probability of an affected drive failing then is:
0.4% for 1 day
11.1% over 30 days
29.7% over 90 days
76.0% over 365 days
Obviously, I'm ignoring the fact that initially a higher number of events (320) must be logged to trigger the failure. Anyway, this would not change the numbers substiantally and the initial number might be even lower than 256 depending on the number of events logged during the manufacturing process. I'm also ignoring the number of events written while the drive is powered on, as it should not affect the overall probability.
0 - A drive is affected by the bug if it contains the defective firmware and has been tested on certain test stations.
Seagate Barracuda 7200.11 Troubles
in Hard Drive and Removable Media
Posted · Edited by Oliver.HH
Initially, I could not tell whether my drive was affected so my first question was whether Dell's DE12 firmware was based on Seagate's SD15 (or another affected version). Dell support did not have that information. So I read the manufacturing date from my drive's label (September) and compared that to the manufacturing dates of drives which had already failed (from the fail/fine thread in this forum). My impression was that my firmware had a high probability of being derived from the buggy ones and this turned out to be true. In contrast, I've read statements on the web where people just compare the firmware version DE12 to the versions confirmed by Seagate as buggy and then incorrectly deduct that their firmware is OK (try googling "7200.11 +DE12").
By the way, Dell has another fix on their site for ST3750630AS and ST31000340AS drives.