AWS outages, bugs and bottlenecks explained by Amazon
Never-before-seen software bug caused flood of requests creating a massive backlog in the system
By Brandon Butler | Network World US | Published: 16:25, 03 July 2012
Amazon Web Services says power outages, software bugs and rebooting bottlenecks in the US led to a "significant impact to many customers," last week, according to a detailed post-mortem report the company released today about the service disruption.
As storms raged through the mid-Atlantic on Friday night, AWS experienced power outages that initially impacted the company's Elastic Cloud Compute (EC2), Elastic Block Storage (EBS) and Relational Database Service (RDS) offerings, but extended into "control plane," services, such as its Elastic Load Balancer, which are designed is designed to shift traffic away from impacted areas of the company's service.
AWS experienced multiple power outages on Friday night, most of which were handled by a backup generator kicking in to supply power. Shortly before 8 pm PDT, a backup generator failed to fully kick in after a power outage. The company's "uninterruptable power supply," another backup, was depleted within seven minutes. For 10 minutes at 8:04, parts of the impacted data centre did not have power, which brought down the EC2 and EBS services in the impacted area.
As a result, for more than an hour between 8:04 and 9:10 p.m. PDT on Friday, customers were unable to create new EC2 instances or EBS volumes. The "vast majority" of the instances came back online between 11:15 p.m. PDT and just after midnight, AWS says, but that was delayed somewhat because of a bottleneck in the server booting process due to the large number of reboot requests. AWS says removing the bottleneck is an area they will work to improve on in the case of a power failure.
Software bugs led to a backlog on the system
AWS breaks its regions up into multiple availability zones (AZs), which are designed to be isolated from failure. Even though the issues on Friday were centered in a single AZ, AWS ran into more trouble when load balancers attempted to switch traffic to unaffected AZs. "As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn't seen before," the company wrote. The bug caused a flood of requests which, combined with EC2 instances coming back online, created a backlog in the system.
Meanwhile, the company's cloud-based relational database service suffered from the EBS volume being out and another software bug. For customers who had their RDS in the impacted AZ, those services had to wait for the EBS to be restored, which for most customers was by 11 pm. PDT. For customers who have their RDS spread across multiple AZs, AWS says there was a software bug that did not allow automatic failover to the unaffected AZs for some customers. AWS says it's known about the bug since April and it has a mitigation for it, which is in beta and will be rolled out in the coming weeks.
While only a single-digit percentage of customers were impacted by the outage, the scope of AWS's customer base meant the situation impacted a large number of users. Customers such as Netflix, Instagram and Pinterest were among those impacted, including during prime-time Pacific Coast movie watching time for Netflix, which was partially down for portions between 8 and 11 pm PDT on Friday.
"Amazon failed, their failover systems failed"
Netflix Cloud Architect Adrian Cockcroft, who has in the past praised AWS for powering the company's operations, filed somewhat of a play-by-play of the outage via his Twitter feed on Friday night and into Saturday. The company, he says, has architected to AWS's specifications and using multiple AZs. That didn't seem to work on Friday though. On Saturday, Cockcroft tweeted, "We only lost hardware in one zone, we replicate data over three. Problem was traffic routing was broken across all zones."
Shahin Pirooz, CTO and CSO of cloud provider CenterBeam, says AWS certainly shares some blame in this outage. "It seems like they had a house of cards that went down on them," he says. Pirooz says he's surprised so many systems went down at once for AWS. "Amazon failed, their failover systems failed, AWS does own some responsibility in this," he says.
One way to prevent this type of circumstance in the future, he says, is to leverage load balancer, domain name systems and disaster recovery offerings from third parties that are not AWS. A variety of companies offer such services, including New Start Systems, Akamai and DynDNS. The "nirvana" situation, he says, would be giving customers the ability to federate services across multiple public cloud providers. That, he predicts, is still five to 10 years away, though, because industry providers do not yet have common agreed-upon supportable migration standards.
OpenStack is attempting to create that with its project, but open source competitors like Citrix's Apache CloudStack are coalescing around AWS as being the de facto standard.