Follow Us

Disk drives lack reliable failure model

You should absolutely not depend solely on RAID 5

In storage circles, much discussion has arisen from the very interesting papers (here and here) investigating disk drive reliability presented recently at FAST '07. Other columnists and bloggers, such as Frank Hayes and Robin Harris, have already done an excellent job of covering them. Rather than repeat the details, I'd like to take the perspective of what the implications are for service level commitments with the storage infrastructure.

In tiered storage architectures, distinctions among service levels are commonly based on attributes like performance and availability. Given the findings of these studies, it's worthwhile to review service levels and the design of supporting storage tiers.

Of the various findings, two factors stand out in this regard. The first is the lack of a reliable failure predictability model. The Google study, examining attributes such as age, heat, access, and SMART diagnostic data in consumer drives, found many drives failed without prior indication. The Carnegie Mellon (CMU) study does suggest that age is a factor in reliability, but it becomes significant far sooner than expected - in as little as two years. So, while the probability of a drive failing increases as it ages, the only meaningful action that can be taken from a service delivery perspective is to continue with regular tech refreshes (e.g., a 3-year cycle) and perhaps to institute a process to record and analyse disk failure as in these studies, but tailored to the particular environment.

Second, if you are making commitments of availability greater than three nine's (99.9%), the CMU study confirms what hopefully you already know: you absolutely should not depend solely on RAID 5. The increased likelihood of failure among related drives found in the study along with the increasingly long rebuild times required for the current crop of high capacity drives creates a risk of data loss that should not be ignored. In fact, I would suggest that either replication or host-based volume management mirroring to another storage system be implemented to support these availability levels. If this is not feasible then within a single storage array improved availability through mirroring (e.g. RAID 10 or RAID 50 -- mirrored RAID 5 sets), or dual parity (e.g. RAID 6) should be considered.

Disk drives are miraculous devices and, current headlines to the contrary, they are incredibly reliable given what they do. But when you have hundreds or thousands of them spinning continuously, some number of failures are unavoidable. Understanding the risks, reviewing service commitments, and being prepared for the inevitable is a must.

Jim Damoulakis is chief technology officer of GlassHouse Technologies Inc., a leading provider of independent storage services. He can be reached at jimd@glasshouse.com.






Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Desktop modernisation

On the one hand, there is the need to keep the existing desktop environment efficient, secure...

Download Whitepaper

Top 10 myths about virtualising business-critical applications

Even though virtualization has brought positive change to enterprise IT over the last decade,...

Download Whitepaper

Aligning CFO and CIO priorities

Forward-thinking organisations are viewing cloud computing as an investment in business...

Download Whitepaper

The new corporate network

Businesses can’t afford to have employee productivity suffer because they cannot use their...

Download Whitepaper

Techworld UK - Technology - Business

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...
LogMeIn Rescue

Accelerate Your IT Efficiency

View the latest capacity management resources including whitepapers, videos and news.

Find out more...

Site Map

* *