sieve-x

February 24, 2009

Multiple bad sectors (aka. reallocated) cropping up on a week old Seagate 1.5TB:

http://forums.seagate.com/stx/board/messag...message.id=5431

http://stx.lithium.com/stx/board/message?b...thread.id=10196

http://stx.lithium.com/stx/board/message?b...thread.id=10364

How to dump S.M.A.R.T for error checking:

http://www.msfn.org/board/index.php?showtopic=130128

February 18, 2009

************** THIS IS ONLY AN EXAMPLE *******************

PREVIOUS ISSUE SYMPTOMS: MFT errors, rebooted and became BSY.

CURRENT ISSUE SYMPTOMS: Loud clicking noises (not present before)

LISTED AS AFFECTED: Y

REPAIRED: Y

METHOD: PAID DR

APPLIED WRONG FIRMWARE BEFORE: N

UPDATED FW AFTER/BEFORE REPAIR: A

PREVIOUS FIRMWARE: SD15

CURRENT FIRMWARE: SD1A

EXTERNAL DRIVE (USB/ESATA/1394): N

=== START OF INFORMATION SECTION ===

Model Family: Seagate Barracuda 7200.11

Device Model: ST3500320AS

Firmware Version: SD1A

User Capacity: 500 107 862 016 bytes

Device is: In smartctl database [for details use: -P show]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Sat Jan 24 17:08:18 2009 RST

SMART support is: Available - device has SMART capability.

Enabled status cached by OS, trying SMART RETURN STATUS cmd.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 642) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 117) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x103b) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 147144821

3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 158

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1

7 Seek_Error_Rate 0x000f 037 036 030 Pre-fail Always - 10810444186323

9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 908

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 2

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 207

184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 069 066 045 Old_age Always - 31 (Lifetime Min/Max 31/31)

194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 11 0 0)

195 Hardware_ECC_Recovered 0x001a 044 032 000 Old_age Always - 147144821

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 871 -

# 2 Extended offline Completed without error 00% 851 -

# 3 Short offline Completed without error 00% 801 -

# 4 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

=====

This example tells drive may have suffered shipping or manipulation damage. Seek Error is too close to threshold.

February 18, 2009

This thread aims to gather more information and research the current issues for those that repaired

and/or updated their affected drives to newer firmware. It also includes a experimental failure

analysis for those still running old firmware and power-cycling (despite the risks of drive failure).

WHY

- Check reports of problems after the drive was updated to new firmware (with or without repair).

- Check reports from some people that repaired their drives using different methods and are now

having issues (ex. BSODs due to read/write errors, bad sectors, corrupted data, unable to format).

- Check first symptoms before drive became BSY/0 LBA (ex. failed read/write operations).

- Verify variables (like temperature and short DSTs) that can affect drive internal event log count.

- Have a better understanding and more in depth research of failures/current issues.

MINIMUM REQUIREMENTS

1. Have an affected model: Seagate 7200.11, ES2.1, SV35.3, SV35.4, DiamondMax22,

2. Have or had affected firmware: AD14, SD15, SD16, SD17, SD18, SD19, MX15, ...

3. Serial number check at Seagate website tells drive is affected (use uppercase).

TESTING

If your drive was bricked and then recovered and/or updated to latest firmware but now

you're experiencing problems (ex. corrupted files, stop 0x0000008E/C000009C BSODs,

unable to read some files, etc) please fill the appropriate fields below and reply the topic:

PREVIOUS ISSUE SYMPTOMS: (ex. none, just rebooted and drive became 0 MB)

CURRENT ISSUE SYMPTOMS: (ex. bad sectors, corrupted files)

LISTED AS AFFECTED: Y/N (put Y if Seagate serial number check tool says it´s affected)

REPAIRED: Y/N (if it's a drive recovered from BSY/0 MB)

METHOD: NONE / PAID DR SERVICE / SEAGATE DR (ex. i365) / OTHER: (ex. nickname method)

APPLIED WRONG FIRMWARE BEFORE: Y/N

UPDATED FW AFTER/BEFORE REPAIR: A/B

PREVIOUS FIRMWARE: (ex. SD15)

CURRENT FIRMWARE: (ex. SD1A)

EXTERNAL DRIVE (USB/ESATA/1394): Y/N

Follow the optional steps below to provide a more detailed scope of the problem(s):

1. Get HD Tune 2.55 (for a simple and quick read-only surface-scan).

2. Get smartmontools (for dumping S.M.A.R.T attributes and error logs):

Stable 5.38 release (Win32 at end of page):

http://smartmontools.sourceforge.net/download.html

Latest 5.39 (20090303) release from CVS (may be unstable) compiled for Win32:

http://smartmontools-win32.dyndns.org/smartmontools/

smartmontools-5.39-0-20090303.win32-setup.exe

MD5: 1707c505724e71c24fe023b630e7d4fa

3. Create a batch file, copy 'n' paste content below and save it as smartchk.bat:

Using smartmontools 5.38 release

   @echo off
   smartctl -s on -a -q noserial /dev/pd0 >> c:\smartchk.txt

Using smartmontools 5.39 CVS 2009/02/08 (recommended because it provides more information)

   @echo off
   smartctl -s on -a -q noserial /dev/pd0 >> c:\smartchk.txt
   smartctl -l sataphy /dev/pd0 >> c:\smartchk.txt
   smartctl -l scttemp /dev/pd0 >> c:\smartchk.txt
   smartctl -l xerror /dev/pd0 >> c:\smartchk.txt

pd0 for 1st physical drive, pd1 for 2nd and so on. Check under Computer Management > Disk Management.

Support for external drives and RAID is limited. Try adding -d option (option = 3ware, areca, hpt, sat, usbcypress).

4. Run the surface error QUICK scan in HDTune and provide a tiny screenshot (thumbnail).

5. Run the smartchk.bat and provide the information it collects in smartchk.txt.

Links for learning more about S.M.A.R.T and how to interpret attributes/results:

http://en.wikipedia.org/wiki/S.M.A.R.T.

http://www.almico.com/sfarticle.php?id=2

http://www.drivehealth.com/attributes.html

http://www.hdsentinel.com/smart/index.php

S.M.A.R.T DRIVE SELF-TESTS (DST) (fully optional, do them at your own risk)

1. Off-Line Data Collection:

smartctl -t offline /dev/pdX (wait until it completes)

smartctl -l selftest /dev/pdX (to read self-test log)

2. Short Self-Test:

smartctl -t short /dev/pdX (wait until it completes)

smartctl -l selftest /dev/pdX (to read self-test log)

3. Long Self-Test

smartctl -t short /dev/pdX (wait until it completes)

smartctl -l selftest /dev/pdX (to read self-test log)

4. Conveyance Self-Test (commonly used for testing new drive for shipping damage)

smartctl -t conveyance /dev/pdX (wait until it completes)

smartctl -l selftest /dev/pdX (to read self-test log)

The X from pdX is physical drive number from 0-99. These are the same drive self-tests (DST) as

in Seatools and can be used for defect reallocation. It's possible to do it under Windows but drive

should be idle (preferably with no mapped letter) because disk activity can cause test to abort/fail.

NOTE: DST tests don't write anything to existing data and are generally safe (as long firmware don't

have bugs in this aspect) but they stress the drive (ie. no need run a long DST test on a daily base)

and backup is always recommended before running any kind of test (same goes for Seatools).

EXPERIMENTAL FAILURE ANALYSIS (only for drives still under old firmware affected by BSY/0MB)

Those with an affected drive (model + serial number check on Seagate web) still working

under old firmware (AD14, SD15, SD16, SD17, SD18, SD19, etc) and power-cycling (ie. for

a firmware update or because don't wanna apply new firmware anyway) despite risks can

help a small research on failure analysis with the following procedure:

1. Get latest smartmontools 5.39 from CVS or download already compiled for Win32:

http://smartmontools-win32.dyndns.org/smartmontools/

smartmontools-5.39-0-20090303.win32-setup.exe

MD5: 1707c505724e71c24fe023b630e7d4fa

2. Create a batch file, copy 'n' paste content below and save it c:\smartdmp.bat

   @echo off
   smartctl -s on -a -q noserial /dev/pd0 >> c:\smartlog.txt
   smartctl -l sataphy /dev/pd0 >> c:\smartlog.txt
   smartctl -l scttemp /dev/pd0 >> c:\smartlog.txt
   smartctl -l xerror /dev/pd0 >> c:\smartlog.txt
   smartctl -l gplog,0xa1,0+19 /dev/pd0 >> c:\smartlog.txt

pd0 for 1st physical drive, pd1 for 2nd and so on. Check under Computer Management > Disk Management.

Support for external drives and RAID is limited. Try adding -d option (option=3ware, areca, hpt, sat, usbcypress).

3. Run smartdmp.bat everytime before a shutdown or setup do so automatically (ex. Use

group policy tool Start > Run > gpedit.msc and under Computer Configuration > Windows ...

Settings > Scripts click on Shutdown and point it to the batch file). Others may help here.

4. If drive fails and them gets repaired please provide the smartlog.txt contents for

analysis. This may help to pinpoint a pattern (from raw attribute values, logs,

incorrect checksum, etc) for failure prediction and/or workaround (IMPORTANT: Only

provide the log if your drive freezes and PM or file service is prefered because the

log can become large until a drive failure occurs).

IMPORTANT

- Drive serial number is NOT needed and dump does not collects it (-q noserial option)

but you should check it against Seagate web tool. Otherwise it may be a different problem.

- All smartctl tool command-line options are case sensitive and some are release dependent.

- I'm recommending HD Tune 2.55 (not 3.0) over more complex tools (ie. HDDScan, Victoria, etc)

because it's very simple to use, does not includes any dangerous option (ie. erase/write) and also

provides a temperature monitoring and screenshot feature. Use more complex tools at your will.

- You should be aware drive may already have some issue before the firmware update or repair.

- Although procedure can be considered safe it's provided AS IS without any warranties/assistance.

February 14, 2009

Oh my...
It seems that there is still some issues with Seabrick...
It is not clear, but by the thread title i can suppose it is the SD1A.
Comment of warpandas from 01-30-2009 07:24 AM
http://forums.seagate.com/stx/board/messag...ing&page=10
"I completed a successful firmware upgrade on my ST31000340AS one week ago after receiving problems of an I/O device error in Windows Vista when I tried to access my data on the harddrive. Now, one week later, I am completely unable to access my harddrive. BIOS will not detect it.
Anyone else?"
There is another guy (avivahl) reporting similar problems with ST3500320AS in this same thread.

Correct. There are reports of issues with firmware update and also with repair (not all drives are equal).

I'm working on a topic for troubleshooting firmware update/post-repair issues. If anyone interested PM.

For example, procedures like the clearing (read: erase) of G-List are not hassle free, because if it contains

entries then some problems (bad sectors/data-corruption) will appear since these are defective sectors that

gets remapped into to spares (large drives can have thousands of spare sectors for reallocation) and/or if

translator is corrupted after repair (remember not all drives are equal) some 'repaired' may end-up like this:

You guys may want to try out this software.
http://smartmontools.sourceforge.net/index.html
The GUI is,
gsmartcontrols
This allows nondestructive testing of drive, dumping of internal logs
They mention it has "-d usbcypress" support but the latest version I got
doesn't seem to understand this command nor is it in the command line -h listing.
I have a western digital 250gig ide attached to a sabrent ide -> usb adaptor.
The software doesn't see this drive although gmsmartcontrol thinks there is something there (the icon)
There is no way at the commandline to scan ones system and list all drive devices instead of guessing
them one-by-one.
Theres also a daemon monitor called smartd.
Failure Analysis articles listings and the one on "Myth or Metric".
http://smartmontools.sourceforge.net/links.html

Here is latest release from CVS compiled for Win32, but may be unstable:

http://rapidshare.com/files/197069196/smar...win32-setup.exe

http://www.megaupload.com/?d=HY8P3FW0

MD5: ffea4156be1a490daa3205bb07f893f2

Dumping attributes and logs is fine but doing DSTs (they are the same as in SeaTools) can be a

problem if the drive has wrong translation parameters (it may incorrectly reallocate non-defective

sectors) or old firmware (may trigger boot of death since it increases drive Event Log count).

I'm working on a 7200.11 failure analysis research. Requirement is old firmware and setting up

a script or policy to run it on every power-cycle, if it fails the results may pinpoint a pattern (from

raw attribute values, logs, incorrect checksum, etc) that can be used for prediction/workaround.

smartdmp.bat

@echo off
smartctl -s on -a -q noserial /dev/pd0 >> c:\smartlog.txt
smartctl -l sataphy /dev/pd0 >> c:\smartlog.txt
smartctl -l scttemp /dev/pd0 >> c:\smartlog.txt
smartctl -l xerror /dev/pd0 >> c:\smartlog.txt

pd0 for 1st physicaldrive, pd1 for 2nd and so on. Can check under Computer Management > Disk Management.

Support for external drives and RAID is limited. Try adding -d option (option=3ware, areca, hpt, sat, usbcypress)

Most of these options will only work correctly with v5.39 experimental release from CVS (see Win32 links above).

I think Seagate may end-up adding the old Maxtor liability sticker into their new drives...

February 6, 2009

I was about to ask the same question. Aviko, since you are the expert here please

any ideas on where is this incorrect test machine pattern and if it can be changed?

Since the event log is said to be directly related to S.M.A.R.T operations I put my 2

cents that the count (ie. 320 or 320 + any multiple of 256) can be affected/exploited

by running short self-tests or off-line data collection.

I made a very simple procedure (using smartmontools 5.39) for almost complete S.M.A.R.T

dump (including some vendor logs 0xa1, 0xa2, etc) except drive serial# at every shutdown

for anyone keeping affected drive under old firmware (AD14, SD15, SD16, MX15, etc) and

interested in a small failure analysis research. Once the drive bricks again and is fixed the

information may help to develop a simple workaround that predicts or avoid drive failure.

Dumping some information at the terminal before fixing a bricked drive may also help.

Level T (/T):
V4 (dump G-List)
Level 1 (/1):
N5 (dump SMART attributes)
N6 (dump SMART thresholds)
N7 (dump G-List)
N8 (dump critical event log)

Just a thought. Has anyone who recovered their disk from the BSY state and not upgraded the firmware experienced this problem again?
If the problem is related to data being left by the test machine, and the solution presented here zeros the SMART log area, does this solution mitigate the original problem? Is it wiping the affected areas?
I have 10 MX15 disks that are sitting in a bag waiting to be re-installed in a machine. It's a fingers crossed type of wait as I have no idea if any or all of them will come back up, but the reports I'm seeing on the SD/MX1A firmware appear to have other "issues" with sleep times and timeouts, while the MX15 has been ok for me. I was wondering if a pre-emptive wipe of the areas required to recover from the BSY state might prevent it from actually occurring ?

how about downgrade it to AD15 ? when you connect to terminal you have only pcb or disk with pcb ?

January 29, 2009

Questions:
¿What is [DataPattern] in Level T 'm'?
Can be SD1A bricks repaired with the new commands table?

They should work as long you are were dealing with the same issue but SD1A

fixes that and then 'bricking' cause/solution would be something different. About

[DataPattern] I would guess the name says what it does (create/fill data pattern).

Updated my previous post #1045 to shed some light around root cause and S.M.A.R.T.

January 28, 2009

Let's look again into root cause description in a bit more clear way...

Affected drive model and firmware will trigger assert failure (ex. not detected

at BIOS) on next power-up initilization due to event log pointer getting past

the end of event log data structure (reserved area track data corruption) if

drive contains a particular data pattern (from factory test mess) and if the

Event Log counter is at entry 320, or a multiple of (320 + x*256).

My question:
Maxtorman, is the log file written after each power-up (or POR) and before each shut down? It seems to me the #320 is being reached by many users in about 100 days... can that really be from only occasional events like bad sectors and missed writes? See this time histogram:
http://www.msfn.org/board/index.php?showto...st&p=826575
Maxtorman's response:
The log, if my information is correct, is written each time a SMART check is done. This will always happen on drive init, but can also happen at regularly scheduled events during normal usage, as the drive has to go through various maintenance functions to keep it calibrated and working properly.
_______________________

Event log counter could be written every once in a while for example if S.M.A.R.T automatic

off-line data collection (ex. every 4h) is enabled (it is by default and may include a list of

last few errors like the example below), temperature history, seek error rate and others.

smartctl -l error /dev/sda (data below is an example)

SMART Error Log Version: 1
ATA Error Count: 9 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 9 occurred at disk power-on lifetime: 6877 hours (286 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 00 ff ff ff 0f

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 ff ff ff af 00	  02:00:24.339  FLUSH CACHE EXIT
  35 00 10 ff ff ff ef 00	  02:00:24.137  WRITE DMA EXT
  35 00 08 ff ff ff ef 00	  02:00:24.136  WRITE DMA EXT
  ca 00 10 77 f7 fc ec 00	  02:00:24.133  WRITE DMA
  25 00 08 ff ff ff ef 00	  02:00:24.132  READ DMA EXT

Error 8 occurred at disk power-on lifetime: 4023 hours (167 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 03 80 01 32 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  a1 00 00 00 00 00 a0 02   2d+04:33:54.009  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 a0 02   2d+04:33:54.001  IDENTIFY DEVICE
  00 00 00 00 00 00 00 ff   2d+04:33:53.532  NOP [Abort queued commands]
  a1 00 00 00 00 00 a0 02   2d+04:33:47.457  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 a0 02   2d+04:33:47.445  IDENTIFY DEVICE

... list goes on until error 5

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This means that theorically disabling S.M.A.R.T automatic off-line self-test, attributte auto

save (something like: smartctl -s on -o off -S off /dev/sdX) and at system BIOS (before

powering-up the drive again) or even disabling the whole S.M.A.R.T feature set could be

a workaround (crippling S.M.A.R.T would not be a permanent solution becuase it helps

to detect/log drive errors) until the drive firmware is updated.

smartctl -l directory /dev/sda

Log Directory Supported (this one is from an affected model)

SMART Log Directory Logging Version 1 [multi-sector log support]
Log at address 0x00 has 001 sectors [Log Directory]
Log at address 0x01 has 001 sectors [Summary SMART error log]
Log at address 0x02 has 005 sectors [Comprehensive SMART error log]
Log at address 0x03 has 005 sectors [Extended Comprehensive SMART error log]
Log at address 0x06 has 001 sectors [SMART self-test log]
Log at address 0x07 has 001 sectors [Extended self-test log]
Log at address 0x09 has 001 sectors [Selective self-test log]
Log at address 0x10 has 001 sectors [Reserved log]
Log at address 0x11 has 001 sectors [Reserved log]
Log at address 0x21 has 001 sectors [Write stream error log]
Log at address 0x22 has 001 sectors [Read stream error log]
Log at address 0x80 has 016 sectors [Host vendor specific log]
Log at address 0x81 has 016 sectors [Host vendor specific log]
Log at address 0x82 has 016 sectors [Host vendor specific log]
Log at address 0x83 has 016 sectors [Host vendor specific log]
Log at address 0x84 has 016 sectors [Host vendor specific log]
Log at address 0x85 has 016 sectors [Host vendor specific log]
Log at address 0x86 has 016 sectors [Host vendor specific log]
Log at address 0x87 has 016 sectors [Host vendor specific log]
Log at address 0x88 has 016 sectors [Host vendor specific log]
Log at address 0x89 has 016 sectors [Host vendor specific log]
Log at address 0x8a has 016 sectors [Host vendor specific log]
Log at address 0x8b has 016 sectors [Host vendor specific log]
Log at address 0x8c has 016 sectors [Host vendor specific log]
Log at address 0x8d has 016 sectors [Host vendor specific log]
Log at address 0x8e has 016 sectors [Host vendor specific log]
Log at address 0x8f has 016 sectors [Host vendor specific log]
Log at address 0x90 has 016 sectors [Host vendor specific log]
Log at address 0x91 has 016 sectors [Host vendor specific log]
Log at address 0x92 has 016 sectors [Host vendor specific log]
Log at address 0x93 has 016 sectors [Host vendor specific log]
Log at address 0x94 has 016 sectors [Host vendor specific log]
Log at address 0x95 has 016 sectors [Host vendor specific log]
Log at address 0x96 has 016 sectors [Host vendor specific log]
Log at address 0x97 has 016 sectors [Host vendor specific log]
Log at address 0x98 has 016 sectors [Host vendor specific log]
Log at address 0x99 has 016 sectors [Host vendor specific log]
Log at address 0x9a has 016 sectors [Host vendor specific log]
Log at address 0x9b has 016 sectors [Host vendor specific log]
Log at address 0x9c has 016 sectors [Host vendor specific log]
Log at address 0x9d has 016 sectors [Host vendor specific log]
Log at address 0x9e has 016 sectors [Host vendor specific log]
Log at address 0x9f has 016 sectors [Host vendor specific log]
Log at address 0xa1 has 020 sectors [Device vendor specific log]
Log at address 0xa8 has 020 sectors [Device vendor specific log]
Log at address 0xa9 has 001 sectors [Device vendor specific log]
Log at address 0xe0 has 001 sectors [Reserved log]
Log at address 0xe1 has 001 sectors [Reserved log]

It may also be (theorically) possible to check if the 'specific data pattern' is present in system

area IF it can be read from SMART log pages (using standard ATA interface/specification)

so this could be used to create a simple (multi-platform) tool for verifying if a particular

drive is effectively affected by the issue and maybe even used as workaround solution IF

the wrong pattern data or event counter can be changed (ie. read/write).

January 27, 2009

Finally here is the failure root cause "secret" details (no NDAs were hurt in the process ).

Customer update :
Seagate has isolated a potential firmware issue in certain products, including some Barracuda 7200.11 hard drives and related drive families based on their product platform*, manufactured through December 2008. In some circumstances, the data on the hard drives may become inaccessible to the user when the host system is powered on. Retail products potentially affected include the Seagate FreeAgent® Desk and Maxtor OneTouch® 4 storage
solutions.
As part of our commitment to customer satisfaction, we are offering a free firmware upgrade to those with affected products. To determine whether your product is affected, please visit the Seagate Support web site at http://seagate.custkb.com/seagate/cr...p?DocId=207931.
Support is also available through Seagate's call center: 1-800-SEAGATE (1-800-732-4283)
Customers can expedite assistance by sending an email to Seagate (discsupport*seagate.com). Please include the following disk drive information: model number, serial number and current firmware revision. We will respond, promptly, to your email request with appropriate instructions.
For a list of international telephone numbers to Seagate Support and alternative methods of contact, please access
http://www.seagate.com/www/en-us/about/contact_us/
*There is no safety issue with these products.
Description
An issue exists that may cause some Seagate hard drives to become inoperable immediately after a power-on operation. Once this condition has occurred, the drive cannot be restored to normal operation without intervention from Seagate. Data on the drive will be unaffected and can be
accessed once normal drive operation has been restored. This is caused by a firmware issue coupled with a specific manufacturing test process.
Root Cause
This condition was introduced by a firmware issue that sets the drive event log to an invalid location causing the drive to become inaccessible.
The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one.
During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present
in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention.
For a drive to be susceptible to this issue, it must have both the firmware that contains the issue and have been tested through the specific manufacturing process.
Corrective Action
Seagate has implemented a containment action to ensure that all manufacturing test processes write the same "benign" fill pattern. This change is a permanent part of the test process. All drives with a date of
manufacture January 12, 2009 and later are not affected by this issue as they have been through the corrected test process.
Recommendation
Seagate strongly recommends customers proactively update all affected drives to the latest firmware. If you have experienced a problem, or have an affected drive exhibiting this behavior, please contact your appropriate
Seagate representative. If you are unable to access your data due to this issue, Seagate will provide free data recovery services. Seagate will work with you to expedite a remedy to minimize any disruption to you or your business.
FREQUENTLY ASKED QUESTIONS (FAQ)
Q: What Seagate drives are affected by this "drive hang after power cycle" issue?
A: The following product types may be affected by this problem:
Barracuda 7200.11, Barracuda ES.2 (SATA), DiamondMax 22, FreeAgent Desk, Maxtor OneTouch 4, Pipeline HD, Pipeline HD Pro, SV35.3, and SV35.4. While only some percentage of the drives will be susceptible to this issue, Seagate recommends that all drives in these families be update d to the latest firmware!
Q: What should I do if I think I have a Seagate drive affected by this issue?
A: Since only some drives have this problem, there is a high likelihood your drive is working and will continue to work perfectly. However, Seagate recommends that all drives in the effected families be update d to the latest firmware as soon as possible. Seagate realizes this recommendation may present challenges for some customers, particularly those with large distributed installed bases. Seagate will work with customers to correct this problem, but requests customers take the following initial actions depending on what type of customer they are. For individual end-users, please contact Seagate Technical Support via web, phone or email.
http://seagate.custkb.com/seagate/cr...p?DocId=207931 or 1-800-SEAGATE (1 800 732-4283), or discsupportnseagate.com. If emailing, please include the following disk drive information: model number, serial number and current firmware revision.
Q. If my drives are always on, could I see this issue?
A. No, this can only occur after a power cycle, however Seagate still recommends that you upgrade your firmware due to unforeseen power events such as power loss.
Q: How will Seagate help me if I lost data on this drive?
A. There is no data loss in this situation. The data still resides on the drive and is inaccessible to the end user. If you are unable to access your data due to this issue, Seagate will provide free data recovery services. Seagate will work with you to expedite a remedy to minimize any disruption to you or your business.
Q. Does this affect all drives manufactured through January 2008?
A. No, this only affects products that were manufactured through a specific test process in combination with a specific firmware issue.
Q. Why has it taken so long for Seagate to find this issue on Barracuda ES.2 and SV35?
A. In typical nearline and surveillance operating environments, drives are not power cycled and so are not as likely to experience this issue.
Q. Does this affect the Barracuda ES.2 SAS drive?
A. No, the SATA and SAS drives have different firmware.
Q. How will my RAID-set be affected?
A. If the error occurs, the drive will drop offline after a power cycle. The RAID will go into the defined host specific recovery actions which will result in the RAID operating in a degraded mode or initiating a rebuild if a hot spare is available. If you are unsure how your host will respond to a drop ped drive and have not yet experienced this issue, avoid unnecessary power cycles and refer to manufacturer or support for the appropriate instructions.
Q. Is there a way to upgrade the firmware to my drives if they are in a large RAID-set, or do I need to take the solution offline?
A. The ability to upgrade firmware in a RAID array is system dependant. Refer to your system manufacturer for upgrade instructions.
Q. How can I tell which Barracuda ES/SV35 drives are affected?
A. 1). Check the "Drive model #" against the list of affected models below or
2) check the PN of the drive against the PN list below or
3) Call Seagate Technology, support services at 1-800-SEAGATE (1 800 732-4283), or discsupport*seagate.com
If it is a SV35 SATA drive and it is affected, new firmware will be available 1/23/09

January 27, 2009

I agree that not every Seagate drive (even if its a 7200.11) with a failure at BIOS

detection can linked to 'boot of death' issue and an attempt to repair the incorrect

problem may result in data-loss or even worse damage but the thread is specific,

it has many users with frozen drives that matches the basic requirements (model,

firmware version and serial number) and some are willing to take the risk....

It seens there has been some tension in the air because of the terminal procedure

posted here but rtos has been already available for many years on the net (almost

since the current firmware evolved from Conner) and it's no big deal since the risks

were cleary stated and only some will take their chances (a few may end-up frying

a pcb, losing data if something goes wrong, ex. bad connection). Most will prefer

sending their drives for the free recover/repair option now being offered by Seagate.

I don't think full knowledge of 'boot-of-death' details would allow to create blueprint for

virus writers as that can already be done for years with the flash code and most of today

malware favors information and networks instead of the old destructive payload.

Either defective test machine (read: someone which had just lost his job ) or firmware

bug it does not matter for end-users which were caught by surprise. About the percentage of

affected drives it's something that only Seagate or an external audit may know for sure ...

Media is much more damaging than any obscure info or firmware bugs and Seagate took too

much time to act. The result was overloaded staff, angry customers with downtime, their serial

number tool failed, firmware correction messed up internal validation process and I'm sure that

some customers that paid a premium fee for third-party recovery services feel betrayed.

Seagate general support (chat, toll-free, RMA process, etc) is far from perfect (ie. canned

responses) but it's good when compared to other manufacturers, the 5-year warranty for

desktop items (non-enterprise) WAS a plus (for new desktop drives after Jan/03/2009 that

has been changed to 3 years) and in some cases they will replace a failed drive with

a better/bigger refurbished one to avoid losing a customer.

I hope they make firmware open-source so it can be improved and the bug tracking process is

more flexible/reliable. SSD will catch on in next years, no moving parts and it's lower cost to

manufacture (but it's not failure free). First at mobiles (where frequent standby/parking cycles

is big problem despite drive brand) and later on mainstream desktop/enterprise.

Cheers David. SanTools SMARTMon-UX is great tool and some people here (many had their drives

affected) got p***ed off when you minimized the problem and teased everyone saying that you know

the failure root cause details but are not going tell it because it's a dark secret under NDA...

The obvious question comes to mind .. how do you know your disk suffers from boot-of-death, and not something like a circuit board failure or massive head crash?

January 25, 2009

On old days (back in 2000) I used to hack firmwares for Pioneer burners (DVR-Axx family) as you can see here:
http://gradius.rpc1.org

Those old days reminds me of "conversions" thru firmware patches (ie. Liteon SOHW-812S

to Liteon 832S). That makes me wonder if it's possible to convert/flash a ST3500320AS to

ST3500320NS (Enterprise) using firmware SN06C (or ST31000340AS to ST31000340NS).

I'm becoming more and more clueless: I have one of the famous ST3500320AS (which didn't fail as it was running 24/7), which I bought to replace a failed Samsung HD501LJ (failed after 478 hours of use, already 3 reallocated sectors and 71 pending sectors, and unable to read important sectors at all). I now tried WD disks and bough a RE3 (to be used in RAID) and a 640GB Caviar Black (WD6401AALS)... I immediately returned the RE3 as it was sold without warranty... And I will return the Caviar Black because I cancelled the bad block scan after just 32GB (of 640 GB): Between 73 and 125 uncorrected read errors reported to OS, 208 raw read errors, 2 reallocated sectors, 2 current pending sectors, and the log shows "UNC" unrecoverable data errors.
Up to now only laptop 2.5" drives failed on me after years of use, which I can understand due to shock and wear.
I had very good experience with Seagate in the past (mainly SCSI drives like Cheetah 15k) but the current 7200.11 events made me reluctant to stay with Seagate (even though these stories were regarding firmware and not data on the platters). Both Samsung and WD failed with lots of unrecoverable data errors on brand new drives...
So... Which hard disk manufacturer and which drive models can I rely on?
73, Arnd

I still have a 7 year old 36 GB Atlas II alive and running. SCSI always had a higher reliability

than (S)ATA drives (although higher reliability does not mean failure free - myself a victim of

a Micropolis 2217 crash very long time ago). SATA is the cheap bandwagon of reliability and

you could still rely on Seagate SAS/SCSI (all manufacturers have an issue at some point).

January 20, 2009

The last year (2008) was bad for them and this year results is going to be a lot worse... If investors keeps pulling out their

money they may end-up like Micropolis or get incorporated by a Korean competitor (and become Seasung... ). Anyway their

Barracuda 7.200 series is now pretty much commercially condemned (who´s gonna buy a 7200.13 ?) and no firmware fix for that.

Meanwhile at Seagate:

Seagate's stock were almost $30/share back in 10/13/2003.
Today, is as low as $4.11/share !!
The HDD problems started around July 2008, look this incredible conscience!
It started to drop in 6/2/2008 from $22.27:
http://moneycentral.msn.com/investor/chart...p;CP=0&PT=7
Meanwhile at Seagate:

January 19, 2009

Didn't had time to read everything yet but the firmware cannot retrieve current

time/date from your computer. It is something probably related to spin-up times,

power-on hours or something else (like G-List) which degrades over time...

I think Seagate should make their firmware open-source AS IS even if this requires

applicants to abdicate warranty so developers and beta-testers can improve their

product, fix bugs, test and even offer better support (currently their tech-support is

overloaded with this 7200.11 issue and even the serial number application is down).

They should at least describe exactly what caused the problem and what was done

to fix it because currently there is not warranty that a patched drives with newer

firmware will not fail again after 2-3 years due to same reasons.

A few years ago Maxtor (and Quantum) were plagued by translation and firmware corruption

issues and they didn't learn from it (Maxtor was acquired by Seagate) ? Maybe some of their

their employees were trying to warrant their jobs by closing their eyes for this issue...

News:

http://stocks.us.reuters.com/stocks/keyDev...=20090114192200

Sign In

sieve-x

Posts

Joined

Last visited

Donations

Country

Content Type

Profiles

Forums

Events

Posts posted by sieve-x

Seagate Barracuda 7200.11 Troubles

Seagate Saga - Issues after firmware update/repair

Seagate Saga - Issues after firmware update/repair

Seagate Barracuda 7200.11 Troubles

The Solution for Seagate 7200.11 HDDs

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Seagate Barracuda 7200.11 Troubles

Activity

Browse