The best way to maximise server uptime
The elusive goal of keeping things going
By John Edwards | Computerworld US | Published: 09:20, 11 November 2010
Most managers agree that carefully planning all server-related work, from acquisition to management to replacement, is a key step in guaranteeing system reliability.
Raoul Gabiam, IT operations and engineering manager at George Washington University, says lifecycle management is an integral part of server uptime planning in his shop. "Knowing when and how to replace hardware and upgrade software is important, as it affects performance, sustainability and overall uptime," he says.
For example, if you have to perform a software upgrade, understanding the hardware requirements and the state of your current hardware is critical. You may want to buy the hardware as part of the software upgrade to ensure that requirements are met and to avoid further outages, or perform one before the other to minimise the number of changes, Gabiam explains.
Gabiam is also a strong believer in standardisation and coordination as a way of ensuring reliable server operation. "Before anybody installs anything or makes a change, there has to be a change management process," he says.
Change management means knowing "how everything is configured and stood up, and [evaluating] the changes before they're implemented," Gabiam says. "That way, you'll always know how things are supposed to be and how things will interact."
He says the discipline of change management makes it possible to predict how servers will react when configured in certain ways or if placed into a new environment.
Paul Franko, chief technology officer at Online Resources, a company that provides transaction services to financial institutions, says attitude plays a big role, too. He says he makes an extra effort to ensure that routine yet critical server-related tasks are taken seriously and addressed promptly.
"We've put in a system of checks and balances to make sure that our policies are being followed," he says. According to Franko, having managers routinely examine staff members' administrative work along with double checking in other ways helps minimise the impact of human error. "People make mistakes, and if you don't have multiple points of verification then things are going to slip through the cracks," he explains.
Practice preventive maintenance
Routine preventive maintenance is perhaps the easiest and least painful way of bolstering server reliability. "Your uptime is only as high as the weakest component in the delivery chain," Beddoe says. Performing a variety of essential tasks, updating system software, providing conditioned power and ensuring adequate cooling, can go a long way toward creating a data centre full of happy servers without breaking the budget or distracting staff members from other vital tasks.
To ensure that all necessary work is performed when required, server maintenance tasks should be identified and organised into a schedule, says Franko. "There are certain things that need to go [into place] straight away, like security updates, and there are other things that make sense to batch up and apply at regular intervals." This second category includes software updates with non-critical functionality improvements, for example.
Franko adds that maintenance work should be handled in such a way that the practice itself doesn't steal server uptime. "We don't take the system down for doing certain types of maintenance activities, we strive for that, anyway," he says.
When it's essential to pull down a server for maintenance, Franko's team schedules the work for an overnight or weekend time frame when user demand is low. The only legitimate reason for pulling a functional server down during regular business hours would be the installation of a critical software update, such as the application of a zero-day security patch.