NHS SAN failure
Why did 80 NHS trusts lose access to SAN data?
The principal CSC alliance members include Hedra, a public sector change management specialist; iSOFT, whose application suite forms the core of the alliance’s software solution; and SCC, who will provide both infrastructure and desktop management services. There have been recent problems with iSoft.
The NHS contract is worth almost a billion pounds over ten years and data access failure simply should not happen.
What events took place?
There was a power failure. This followed an engineering team working on problems with uninterruptible power supplies UPS) at the centre on Sunday, July 30th. Ironically their work was interrupted by fluctuating power supplies with a spike which knocked out server computers in the data centre. The SAN was shut down as a preventative measure. Then, amazingly, the backup systems could not be accessed.
This happened at 10am on Sunday morning and cut off the 80 NHS Trusts from their data.
A CSC representative said about the affected NHS Trusts: "They've gone into disaster recovery mode." Well, yes, obviously. But they shouldn't have to. When you sign a billion pound outsourcing contract you expect, at the very least, access to your data. You also expect, in fact you demand, that if a data centre is knocked out for some reason then a backup facility comes on line instantly.
Neither happened. CSC Alliance fouled up and let the NHS and other customers down. To compound the fault it then kept the Trusts offline for two to four days.
Private clients of CSC were up and running again quite quickly. The NHS Trust customers suffered up to four days interrupted data access. There was something about the NHS Trusts' system setup that rendered it particularly susceptible to what happened.
CSCS Alliance called in HDS engineers to help diagnose and fix the problems with the SAN. An HDS statement reads:
"On Sunday 30 July 2006 a power failure occurred at the CSC computer centre running Hitachi storage systems which support NHS patient records. As a result of the power failure, the storage systems were temporarily affected in the West Midlands and North West of the UK. Hitachi Data Systems immediately responded with technical engineers from the UK, EMEA and the Hitachi factory. The systems have now been restored to the users. No patient data was lost. However, during the period of time when the systems were affected, users had to use a manual backup system.
"We would like to stress the situation at CSC is highly unusual. Our storage systems are designed to protect critical customer data in the event of any planned or unplanned downtime and the Hitachi storage systems at CSC were restored with all data intact. ... The exact cause of the storage devices becoming temporarily unavailable is part of an in-depth investigation."
In other words neither HDS nor CSC Alliance know what happened. Obviously in such a situation they can't prevent it from happening again. The disaster recovery testing was clearly flawed and CSC's business continuity planning inadequate.
The HDS statement says that its disks didn't lose data. But that was only one third of the job of this SAN. The system was designed to failover if a disaster happened. It didn't. Furthermore data access, something else the HDS arrays were obviously designed for, was interrupted significantly because the SAN could not be bought back online. One out of three is not a complete failure but, in a billion pound contract, it is simply not acceptable.
Recovery was slowed by CSC having to test each hard drive before bringing it back online.
The situation at the moment is that that the NHS is paying CSC a billion pounds to provide continual and reliable access to patient admission data on a SAN and it isn't getting what it's paid for.