What is Time Limited Error Recovery?
Avoiding duplicate error recovery responsibility
By Hubbert Smith, Western Digital | Published: 16:00, 24 November 2004
[We have taken this article from Western Digital as it is a clear exposition of a problem, and a response to it, that can save needless ATA RAID array problems caused by duplicate responsibilities for error recovery]
Meeting the demands of the enterprise environment means finding ways to improve performance, improve compatibility, decrease down time and reduce total cost of ownership. To that end, Western Digital has introduced Time-Limited Error Recovery (TLER) in its WD Caviar RAID Edition (RE) hard drives to improve coordinated error handling between hard drives and RAID controllers.
Desktop hard drives are designed under the assumption there is no RAID card and desktop drives include error correction such as the ability to handle write errors and reallocate around bad blocks (all hard drives do this). During error correction, desktop hard drives do not issue error messages or respond to commands by adapters. Desktop hard drives are designed with the assumption they should do everything possible to complete error correction (the design assumes there is no RAID controller to help with error recovery). The difficulty comes when error correction takes longer than 8 seconds and RAID controllers assume the non-responding disk has failed and the RAID controller drops the hard drive from the RAID volume.
ATA drives being “dropped” from a RAID volume was an often-heard complaint, regardless of manufacturer of the hard drive. This error handling “mis-coordination” is encountered when drives are under a high I/O load such as a video surveillance server, a busy e-mail server, or a busy web server. Under high I/O load, the length of time needed to recover increases and the probability of an error recovery exceeding the typical 8-second RAID timeout is significantly increased.
When a drive is under a continuous I/O load and performs its own error recovery, it can easily exceed 8 seconds. During that time, the normal desktop hard drive does not respond. RAID cards will typically wait 8 seconds for a drive to respond, and if the drive does not respond, RAID cards are programmed to take action. The “mis-coordination” of error handling between hard drives and RAID cards occurs when desktop drives are programmed to take responsibility for all error recovery, while RAID cards are also programmed to take responsibility for error recovery.
The consequences of this mis-coordinated error handling are significant. After the drive has been dropped from the RAID volume, the RAID volume runs in degraded mode until a replacement drive is supplied. After a replacement drive is supplied, assuming it is configured as a RAID 5 volume, the RAID volume must be rebuilt from parity data.
While the RAID volume is running in degraded mode (parity recovery mode), the disks work harder as they must process the normal I/O load and process the parity I/O. This further increases the likelihood that an error recovery will exceed 8 seconds. Also, once the dropped drive is replaced, the RAID volume must be rebuilt. For large volumes (one to 10 terabytes), this rebuild process can take hours to days. Like driving a car without a spare tire, if another drive fails, all data on the volume is lost. The probability of this happening is increased when all drives work harder to handle both the normal I/O load and the parity rebuild I/O load.
Hard drives designed for servers are designed with the assumption there is a RAID controller present and some coordination of error management must occur. Western Digital has delivered that coordinated error management in the form of “Time Limited Error Recovery” (TLER).
TLER-capable hard drives will perform the normal error recovery and after seven seconds, the drive will issue an error message to the RAID controller and the drive will defer the error recovery task until a later time. With coordinated error handling, the hard drive is not dropped from the RAID array, thereby avoiding the entire RAID recovery, replacement, rebuild, and return experience.
The error handling is further coordinated between the TLER-capable hard drive and the RAID card. The TLER-capable drive will respond without waiting for the error to be resolved. RAID cards are very capable of handling this with a combination of parity protection and journaling. The RAID card flags the error in the error log and will proceed to deliver data using parity protection until the drive retries its own error recovery and corrects the error. This is quite similar to error management proven in SCSI-RAID for many years. It is important to realize TLER hard drives should not be used in non-RAID environments.
TLER improves up-time and reduces hard drive error recovery fallout by limiting the time the drive spends in error recovery, and allowing RAID adapters to properly perform their intended function. This provides increased performance, improved availability and lower total cost of ownership in RAID arrays.
Written by Hubbert Smith, director of enterprise marketing – Western Digital.