Follow Us

Top tips for troubleshooting Fibre Channel networks

Ten habits of highly effective SAN admins

Troubleshooting Fibre Channel networks can be as much an art as it is a science, but there are some basic best practices you can follow to reduce the guessing and speed resolution. Here are ten tips to help you get to the bottom of pesky problems:

1. Generally, problems are reported by the application user. As a first step, the SAN admin will usually gather dumps, logs and traces. At the same time, he'll sometimes remove other users or applications that are less critical, perhaps he'll stop backups and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered. If you've only removed the symptom and you stop there, you're likely to see trouble later on.

2. Use real time monitoring. Ask your vendors what they mean by "real time" — a five minute polling interval is not real time. If a fire starts in your kitchen, would you like to be alerted to it immediately or in five minutes?

Use the real time alerting subsystem to get in front of the issues before the application users feel the pain. We recently saw an example where we examined the I/O history leading up to an application outage and found plenty of obvious pointers four hours before the outage. If best practices alerting had been set up, it's likely the outage could have been avoided.

3. One of the first steps is to determine if the user-reported problem correlates with what's happening on the SAN. But if you only investigate what the user is reporting, you may miss larger issues that may affect other, slightly less latency sensitive apps. It's useful to broaden the scope beyond just the immediate issue.

4. Having said that, you should customise existing, canned reports to quickly focus on the suspected application or infrastructure to isolate the condition. We recently talked with a customer who quickly eliminated about 4,380 out of 4,400 SAN links, enabling them to focus on the remaining 20 links for in-depth trace analysis.

5. Review environment inventories by device type and properties automatically discovered. Such things as manufacturer and link rate can be helpful in understanding special circumstances, such as the behaviour of a tape device or configuration settings that the admin might not be aware of, like links set to run at 1G instead of 4G. Enable users to provide their own context about devices such as applications they support, location, version, relationship to other equipment, etc.

6. As they isolate, correlate and analyse, our customers often report that the majority of the time that they troubleshoot, they find that the SAN is not to blame. Tools that report on the effect of only SAN latency on the application is very helpful in determining this aspect. Tools that lump SAN and server latency together can't help with this.

7. Time correlation is critical to determine cause and effect. When you are looking at long time windows, you often can't tell which event preceded another, and that's when you get finger-pointing from one vendor to another. Try to find the finest granularity in your historical reporting. A one minute interval is often not too granular.

8. Look at your historical I/O patterns, busy times of day, multipath configurations, queue depth settings, top talkers, etc. to gain a profile of behaviour. Then compare to your healthy baseline, and rule out things that haven't changed. You might find six things that appear to be going wrong, but if only one of those things seem to have occurred when the problem was reported, you can focus on that issue immediately. Later on, you can go back to look at the other issues.

9. When changes are made to fix the incident, you should get immediate feedback on whether it's having the desired effect. Sometimes a fix can make a problem worse, so it's good to know that as well. Without immediate feedback, you can often delay or stagger fixes until they can determine the effect of each one. Or if you make all changes at the same time, you can be left wondering which change fixed the problem. Ongoing real time monitoring can provide confidence that the problem in fact was solved.

10. Last, ask for help sooner rather than later. We've heard of problems dragging on for months, vendors kicked out of accounts and literally millions of dollars wasted on adding expensive hardware. Bring in a performance pro. Though there are things you can do to speed troubleshooting and even prevent future problems. Look at the cost of waiting. Balance that with the cost of an expert consultant, someone who spends all day, every day specialising in finding performance problems.






Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Desktop modernisation

On the one hand, there is the need to keep the existing desktop environment efficient, secure...

Download Whitepaper

Top 10 myths about virtualising business-critical applications

Even though virtualization has brought positive change to enterprise IT over the last decade,...

Download Whitepaper

Aligning CFO and CIO priorities

Forward-thinking organisations are viewing cloud computing as an investment in business...

Download Whitepaper

The new corporate network

Businesses can’t afford to have employee productivity suffer because they cannot use their...

Download Whitepaper

Techworld UK - Technology - Business

Techworld Awards

Techworld Awards 2012
Coming Soon

Opening for submissions May 2012

 

Find out more

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...
LogMeIn Rescue

Accelerate Your IT Efficiency

View the latest capacity management resources including whitepapers, videos and news.

Find out more...

Site Map

* *