IT Jobs

Did you know? Techworld now offers an IT Jobs section with hundreds of jobs! Current job listings are now available for Software Developers, Web Developers, Application Engineers, Project Managers, Graduate opportunities and more. Apply for your new IT job today!

Disk drive failures 13 times what vendors say, study says

Drive vendors declined to be interviewed

Customers are replacing disk drives at rates far higher than those suggested by the estimated mean time between failure (MTBF) supplied by drive vendors, according to a study of about 100,000 drives conducted by Carnegie Mellon University.

The study, presented last month at the 5th USENIX Conference on File and Storage Technologies in San Jose, also shows no evidence that Fibre Channel (FC) drives are any more reliable than less expensive but slower performing Serial ATA (SATA) drives.

That surprising comparison of FC and SATA reliability could speed the trend away from FC to SATA drives for applications such as near-line storage and backup, where storage capacity and cost are more important than sheer performance, analysts said.

At the same conference, another study of more than 100,000 drives in data centers run by Google Inc. indicated that temperature seems to have little effect on drive reliability, even as vendors and customers struggle to keep temperature down in their tightly packed data centers. Together, the results show how little information customers have to predict the reliability of disk drives in actual operating conditions and how to choose among various drive types.

Real world vs. data sheets

The Carnegie Mellon study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for those drives listed MTBF between 1 million to 1.5 million hours, which the study said should mean annual failure rates "of at most 0.88 percent." However, the study showed typical annual replacement rates of between 2 and 4 percent, "and up to 13 percent observed on some systems."

Garth Gibson, associate professor of computer science at Carnegie Mellon and co-author of the study, was careful to point out that the study didn't necessarily track actual drive failures, but cases in which a customer decided a drive had failed and needed replacement. He also said he has no vendor-specific failure information, and that his goal is not "choosing the best and the worst vendors" but to help them to improve drive design and testing.

He echoed storage vendors and analysts in pointing out that as many as half of the drives returned to vendors actually work fine and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive.

Several drive vendors declined to be interviewed. "The conditions that surround true drive failures are complicated and require a detailed failure analysis to determine what the failure mechanisms were," said a spokesperson for Seagate Technology in Scotts Valley, Calif., in an e-mail. "It is important to not only understand the kind of drive being used, but the system or environment in which it was placed and its workload."

"Regarding various reliability rate questions, it's difficult to provide generalities," said a spokesperson for Hitachi Global Storage Technologies in San Jose, in an e-mail. "We work with each of our customers on an individual basis within their specific environments, and the resulting data is confidential."

Ashish Nadkarni, a principal consultant at GlassHouse Technologies Inc., a storage services provider in Framingham, Mass., said he isn't surprised by the comparatively high replacement rates because of the difference between the "clean room" environment in which vendors test and the heat, dust, noise or vibrations in an actual data center.

He also said he has seen overall drive quality falling over time as the result of price competition in the industry. He urged customers to begin tracking disk drive records "and to make a big noise with the vendor" to force them to review their testing processes.

FC vs. SATA

While a general reputation for increased reliability (as well as higher performance) is one of the reasons FC drives cost as much as four times more per gigabyte than SATA, "We had no evidence that SATA drives are less reliable than the SCSI or Fibre Channel drives," said Gibson. "I am not suggesting the drive vendors misrepresented anything," he said, adding that other variables such as workloads or environmental conditions might account for the similar reliability finding.

Analyst Brian Garrett at the Enterprise Storage Group in Milford, Mass., said he's not surprised because "the things that can go wrong with a drive are mechanical -- moving parts, motors, spindles, read-write heads," and these components are usually the same whether they are used in a SCSI or SATA drive. The electronic circuits around the drive and the physical interface are different, but are much less prone to failure.

Vendors do perform higher levels of testing on FC than on SATA drives, he said, but according to the study that extra testing hasn't produced "a measurable difference" in reliability.

Such findings might spur some customers to, for example, buy more SATA drives to provide more backup or more parity drives in a RAID configuration to get the same level of data protection for a lower price. However, Garrett cautioned, SATA continues to be best suited for applications such as backup and archiving of fixed content (such as e-mail or medical imaging) that must be stored for long periods of time but accessed quickly when it is needed. FC will remain the "gold standard" for online applications such as transaction processing, he predicts.

Don't sweat the heat?

The Google study examined replacement rates of more than 100,000 serial and parallel ATA drives deployed in Google's own data centers. Similar to the CMU methodology, a drive was considered to have failed if it was replaced as part of a repair procedure (rather than as being upgraded to a larger drive).

Perhaps the most surprising finding was no strong correlation between higher operating temperatures and higher failure rates. "That doesn't mean there isn't one," said Luiz Barroso, an engineer at Google and co-author of the paper, but it does suggest "that temperature is only one of many factors affecting the disk lifetime."

Garrett said that rapid changes in temperature -- such as when a malfunctioning air conditioner is fixed after a hot weekend and rapidly cools the data center -- can also cause drive failures.

The Google study also found that no single parameter, or combination of parameters, produced by the SMART (Self-Monitoring Analysis and Reporting Technology) built into disk drives is actually a good predictor of drive failure.

The bottom line

For customers running anything smaller than the massive data centers operated by Google or a university data center, though, the results might make little difference in their day-to-day operations. For many customers, the price of replacement drives is built into their maintenance contracts, so their expected service life only becomes an issue when the equipment goes off warranty and the customer must decide whether to "try to eke out another year or two" before the drive fails, said Garrett.

The studies won't change how Tom Dugan, director of technical services at Recovery Networks, a Philadelphia-based business continuity services provider, protects his data. "If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

See also

- No Standard storage arrays for Google

- Network Appliance view - RAID 5 verges on professional malpractice.

- EMC view - much ado about nothing.


What are your views on this subject? Use the form below to post a comment on this article up to 500 characters.


Characters remaining: 500

Related Storage news

HP tool offers continous laptop backup

Set it and forget.

Intel fixes drive bricking firmware update for flash drives

Company to re-release SSD software

IBM offers Lotus Symphony on Keepod USB devices

Thin USB device uses VMware to provide secure access to the Lotus suite

Sun claims record-breaking storage array

Says Storage 7000 is fastest on the planet

Related Storage reviews



Email this article to a friend or colleague:


PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Database security: Preventing enterprise data leaks at the source

IDC discusses the growing internal threats to business information, the impact of government regulations on the protection of data, and how enterprises must adopt database security best practices...

Download Whitepaper

Service-oriented security

SOA has become an integral part of enterprise software by providing a framework to efficiently develop software as services that is easily sharable, reusable, and integrated. No where is the need more apparent than in the Identity Management space. Welcome to the age of Service-Oriented Security (SOS).

Download Whitepaper

Data protection prospective vendor checklist

Organisations need a way to map business needs against all these challenges in procuring a technical solution. To help, SANS has developed the following Prospective Vendor Checklist.

Download Whitepaper

Unlock the power of the mainframe

This whitepaper presents the notion of CICS as an integration hub based on a component-based, service-oriented architecture supporting Web services. Highlights will review the challenges and contrasted support for Web services natively in CICS.

Download Whitepaper

Techworld UK - Technology - Business

COLT White Paper

Are all VoIP services the same?

Questions to ask your service provider to ensure you get the VoIP service you need
With careful choice of partner, your business can have all the advantages of VoIP access - reduced costs, flexibility and simplicity - without the drawbacks.
This white paper is your guide to ensure you get right the VoIP service and details the pitfalls which businesses would do well to avoid.

Download white paper
BMC

Ride the express lane in the journey to speed ITIL adoption

Explore the challenges in making the journey to ITIL and the criteria for selecting consulting services
By following ITIL practices, your IT organisation will become more closely integrated with the business. We recommend making the journey to ITIL in a sequence of six incremental steps, the phases of which are driven through execution of a strategic transformational roadmap.

Download white paper

Webcast: IT Financial Management: Cost Optimisation for Efficiency and Agility.
On Demand Webcast
Join this webcast to learn about the techniques and technologies that can help you prove the value of IT to the business by understanding the true cost of today's IT services and those that will be necessary to deliver future success.

Register Today

Site Map

IDG Network

* *