A Major Outage At AWS Has Caused Chaos At Amazon’s Own Operations, Highlighting Cloud Computing Risks

A lengthy outage at Amazon Web Services (AWS), the cloud computing arm of Amazon, caused chaos today for many companies and millions of users. The mega glitch affected access to a wide range of services, including shows on Netflix and Disney+, web services from airlines such as Delta and Southwest, and payments businesses such as Venmo.

Many of Amazon’s own offerings, including the Ring smart doorbell service, its Alexa virtual assistant and its Amazon Music Service were impacted. The impact was also felt at Amazon’s delivery operations, with delivery drivers reportedly unable to access information via apps.

The outage, which began this morning around 10.45am Eastern Time according to Downdetector, which tracks website outages and stretched into the early evening. In a statement published around 12.30pm, the company said that it was seeing multiple issues at data centers in its U.S.-East-I region, which it said were caused by “the impairment of several; network devices.”

At just after 5pm, AWS said it had “executed a mitigation”, which was producing “a significant recovery in the region” but did not have a timeline for full recovery. Downdetector was still showing plenty of reports of problems with the service still coming in.

Cloud concentration

The episode underlines just how dependent businesses have become on the tech giants that deliver cloud computing services—and how dependent those companies have become on their own technology. The pandemic has accelerated the move to the cloud, which enables businesses to spin up new computing capacity fast and to tap into a wide range of services, from AI algorithms to quantum computers, offered by AWS, Microsoft’s Azure and Google Cloud, which dominate the U.S. market for public cloud services.

That has juiced revenues for the companies, with businesses expected to spend over $330 billion on cloud services this year according to a forecast made earlier this year by Gartner. In a race to win business, AWS and its rivals are racing one another to create more offerings, which in turn is making the management of the infrastructure to support them more complex.

“As feature functionality explodes, they are having to manage it all and you can’t do it manually,” says Doug Madory of Kentick, a company that provides data and analytics on IT networks to businesses. “You have to automate it and it’s very hard to anticipate every possible failure.”

One challenge the cloud giants face is to stay on top of interdependencies that could trigger systems to fail simultaneously. In October, Facebook and its other major services, including Messenger and WhatsApp, went down for over six hours after engineers working on its global backbone, which involves thousands of routers and tens of thousands of miles of fiber-optic cables, accidentally triggered an outage across its data centers.

At the time, Facebook noted that part of the reason tackling the outage took so long was that some of the software tools it needed to treat the problem were unavailable because of the outage, which also shutdown automated access to some of its data centers. Engineers were forced to drive to some locations to get them back online.

Reckoning with regions

In its statement this morning, AWS (which did not return repeated requests for comment from Forbes) noted that the incident had affected some of its “monitoring and incident tooling”, which it said had affected its ability to provide updates. Cloud experts say that cloud companies face a conundrum here. Running such tools on separate networks run by other companies could avoid this headache, but this would also increase the risk that hackers could penetrate those networks and use the tools to compromise core cloud operations.

Amazon’s outage also raises another issue. Cloud providers run data centers in multiple regions around the world. Companies can pay to run workloads in different regions, so if one goes down another can act as a backup. But AWS’s U.S.-East-1 region is especially popular given the concentration of businesses on the U.S. East Coast, so any glitches affecting it have substantial impact.

CIOs may need to think about paying up for rollover plans, if they aren’t doing so already. They may also want to spread risk across multiple clouds and consider other contingency plans. “IT and application teams have multiple tools at their disposal,’ said Kris Beevers, the CEO of NS1, which helps companies manage and deliver software applications. “It’s critical for them to do the work upfront to prepare playbooks and levers to manage against these kinds of events.”

Leave a Reply

Your email address will not be published. Required fields are marked *