Follow Us

We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message

Just in case there's another Amazon outage, here are four tips

You never know when the next natural disaster might hit...

Article comments

It happened again: Amazon Web Service suffered its latest outage in late June. Now, as the dust has settled, customers are reassessing what lessons can be learned and how to prepare for the inevitable next one.

Compared to AWS's major outage last summer, which was caused by human error and resulted in an overloaded network, the most recent incident resulted from an electrical storm causing a power outage in an AWS Virginia data center. While the actual outage only lasted about 20 minutes, the domino effect of a backup generator not kicking in, combined with software bugs AWS had not seen before, caused about 7 percent of customers in the impacted area to be down, some for as much as three hours on the evening of Friday, 29 June.

As storms ripped through the mid-Atlantic coast that Friday night and into Saturday morning, parts of sites such as Netflix, Pinterest and Instagram were down, sometimes for as much as three hours. But it didn't have to be that way. Software startup Newvem tracks AWS customer usage, and officials say misconfigurations by customers exacerbated the problem on that Friday night. Newvem and Netflix have four suggestions of how the latest outage could have been mitigated and how to prepare for future incidents.

1: Use snapshots

Backing up data is critically important to ensure high availability and AWS gives customers the option of backing up their Elastic Block Store (EBS), which is a file storage service impacted during the latest outage, with a "snapshot." EBS Snapshots make a copy of the EBS volume and back it up in Amazon's accompanying Simple Storage Service (S3) offering. User to have to initially back up their entire EBS volume to S3, but then whenever there is a change to the content of the EBS volume, only the new data has to be captured in another snapshot for the volume to be recreated. Of Newvem's more than 500 customers, 45% of users who have large AWS clouds, meaning those with more than 101 instances, did not have effective EBS snapshots.

2. Ensure correct ELB configurations

One of the advantages of using Elastic Load Balancers (ELBs) is they can automatically reroute traffic based on availability and need. But Newvem found that up to 20 percent of heavy users aren't properly configuring their ELBs either. One of the most common misconfigurations is to reroute ELB traffic within the same availability zone (AZ). AWS has multiple availability zones within its regions, which are meant to be isolated from one another. By not configuring the ELB to route traffic to a separate AZ, users aren't protected if their AZ is impacted, Newvem says.

3. Test, test, test

One of the bigger names that went down during the latest AWS outage was Netflix, which during the past few years has migrated much of the company's video streaming services to the AWS cloud. During the latest outage, the site had selective service disruptions between 8 and 11 pm PDT on the Friday night of the outage.

In response, Netflix wrote a blog post outlining changes they will make to prepare for an AWS disruption and area they hope to ramp up is testing. Netflix already has "Chaos Monkey," which simulates an outage of random instances within the Netflix AWS cloud. But that's apparently not good enough. The company is now developing a "Chaos Gorilla," which will simulate an entire availability zone going down to ensure the system can automatically handle the situation.

4. Not just multi-AZ, but multi-region

After last year's outage, AWS officials reminded users that using multiple availability zones is the best way to ensure AWS cloud resiliency. Now, Netflix and Newvem officials agree that instead of a multiple AZ architecture, spreading workloads across multiple regions, or even across multiple cloud providers is the best way to ensure high availability. "Using multiple regions is really the new best practices for customers that really require high availability," Newvem CEO Zev Laderman says. Netflix says as it is expanding its global footprint to allow streaming of its video content around the world, it will be moving to a multi-region support system as well.


More from Techworld

More relevant IT news


Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Choose – and Choose Wisely – the Right MSP for Your SMB

End users need a technology partner that provides transparency, enables productivity, delivers...

Download Whitepaper

10 Effective Habits of Indispensable IT Departments

It’s no secret that responsibilities are growing while budgets continue to shrink. Download this...

Download Whitepaper

Gartner Magic Quadrant for Enterprise Information Archiving

Enterprise information archiving is contributing to organisational needs for e-discovery and...

Download Whitepaper

Advancing the state of virtualised backups

Dell Software’s vRanger is a veteran of the virtualisation specific backup market. It was the...

Download Whitepaper

Techworld UK - Technology - Business

Innovation, productivity, agility and profit

Watch this on demand webinar which explores IT innovation, managed print services and business agility.

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...

From Wow to How : Making mobile and cloud work for you

On demand Biztech Briefing - Learn how to effectively deliver mobile work styles and cloud services together.

Watch now...

Site Map

* *