Follow Us

We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message

What is Time Limited Error Recovery?

Avoiding duplicate error recovery responsibility

Article comments

[We have taken this article from Western Digital as it is a clear exposition of a problem, and a response to it, that can save needless ATA RAID array problems caused by duplicate responsibilities for error recovery]

Meeting the demands of the enterprise environment means finding ways to improve performance, improve compatibility, decrease down time and reduce total cost of ownership. To that end, Western Digital has introduced Time-Limited Error Recovery (TLER) in its WD Caviar RAID Edition (RE) hard drives to improve coordinated error handling between hard drives and RAID controllers.

The Problem
Desktop hard drives are designed under the assumption there is no RAID card and desktop drives include error correction such as the ability to handle write errors and reallocate around bad blocks (all hard drives do this). During error correction, desktop hard drives do not issue error messages or respond to commands by adapters. Desktop hard drives are designed with the assumption they should do everything possible to complete error correction (the design assumes there is no RAID controller to help with error recovery). The difficulty comes when error correction takes longer than 8 seconds and RAID controllers assume the non-responding disk has failed and the RAID controller drops the hard drive from the RAID volume.

ATA drives being “dropped” from a RAID volume was an often-heard complaint, regardless of manufacturer of the hard drive. This error handling “mis-coordination” is encountered when drives are under a high I/O load such as a video surveillance server, a busy e-mail server, or a busy web server. Under high I/O load, the length of time needed to recover increases and the probability of an error recovery exceeding the typical 8-second RAID timeout is significantly increased.

When a drive is under a continuous I/O load and performs its own error recovery, it can easily exceed 8 seconds. During that time, the normal desktop hard drive does not respond. RAID cards will typically wait 8 seconds for a drive to respond, and if the drive does not respond, RAID cards are programmed to take action. The “mis-coordination” of error handling between hard drives and RAID cards occurs when desktop drives are programmed to take responsibility for all error recovery, while RAID cards are also programmed to take responsibility for error recovery.

The consequences of this mis-coordinated error handling are significant. After the drive has been dropped from the RAID volume, the RAID volume runs in degraded mode until a replacement drive is supplied. After a replacement drive is supplied, assuming it is configured as a RAID 5 volume, the RAID volume must be rebuilt from parity data.

While the RAID volume is running in degraded mode (parity recovery mode), the disks work harder as they must process the normal I/O load and process the parity I/O. This further increases the likelihood that an error recovery will exceed 8 seconds. Also, once the dropped drive is replaced, the RAID volume must be rebuilt. For large volumes (one to 10 terabytes), this rebuild process can take hours to days. Like driving a car without a spare tire, if another drive fails, all data on the volume is lost. The probability of this happening is increased when all drives work harder to handle both the normal I/O load and the parity rebuild I/O load.

The Answer
Hard drives designed for servers are designed with the assumption there is a RAID controller present and some coordination of error management must occur. Western Digital has delivered that coordinated error management in the form of “Time Limited Error Recovery” (TLER).

TLER-capable hard drives will perform the normal error recovery and after seven seconds, the drive will issue an error message to the RAID controller and the drive will defer the error recovery task until a later time. With coordinated error handling, the hard drive is not dropped from the RAID array, thereby avoiding the entire RAID recovery, replacement, rebuild, and return experience.

The error handling is further coordinated between the TLER-capable hard drive and the RAID card. The TLER-capable drive will respond without waiting for the error to be resolved. RAID cards are very capable of handling this with a combination of parity protection and journaling. The RAID card flags the error in the error log and will proceed to deliver data using parity protection until the drive retries its own error recovery and corrects the error. This is quite similar to error management proven in SCSI-RAID for many years. It is important to realize TLER hard drives should not be used in non-RAID environments.

Conclusion
TLER improves up-time and reduces hard drive error recovery fallout by limiting the time the drive spends in error recovery, and allowing RAID adapters to properly perform their intended function. This provides increased performance, improved availability and lower total cost of ownership in RAID arrays.

Written by Hubbert Smith, director of enterprise marketing – Western Digital.


Share:

More from Techworld

More relevant IT news

Comments



Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Choose – and Choose Wisely – the Right MSP for Your SMB

End users need a technology partner that provides transparency, enables productivity, delivers...

Download Whitepaper

10 Effective Habits of Indispensable IT Departments

It’s no secret that responsibilities are growing while budgets continue to shrink. Download this...

Download Whitepaper

Gartner Magic Quadrant for Enterprise Information Archiving

Enterprise information archiving is contributing to organisational needs for e-discovery and...

Download Whitepaper

Advancing the state of virtualised backups

Dell Software’s vRanger is a veteran of the virtualisation specific backup market. It was the...

Download Whitepaper

Techworld UK - Technology - Business

Innovation, productivity, agility and profit

Watch this on demand webinar which explores IT innovation, managed print services and business agility.

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...

From Wow to How : Making mobile and cloud work for you

On demand Biztech Briefing - Learn how to effectively deliver mobile work styles and cloud services together.

Watch now...

Site Map

* *