Amazon says AWS cloud service back to normal after outage disrupts businesses worldwide

Disruption knocked workers from London to Tokyo offline and halted others from conducting normal tasks

Users on Monday afternoon complained of lingering difficulties using services such as the digital wallet Venmo and video calling site Zoom. (Anushree Fadnavis)

Amazon.com cloud service returned to normal operations on Monday afternoon, the company said, after an internet outage that caused global turmoil among thousands of sites, including some of the web’s most popular apps such as Snapchat and Reddit.

However, Amazon said some AWS services had a backlog of messages that would take a few hours to process.

AWS hosts applications and computer processes for companies around the world, and the disruption knocked workers from London to Tokyo offline and halted others from conducting normal tasks such as paying hairdressers and changing their airline tickets. Users on Monday afternoon had complained of lingering difficulties using services such as the digital wallet Venmo and video calling site Zoom.

It was the largest internet disruption since last year’s CrowdStrike malfunction hobbled technology systems in hospitals, banks and airports, highlighting the vulnerability of the world’s interconnected technologies.

It was at least the third time in five years that AWS’s northern Virginia cluster, known as US-EAST-1, contributed to a major internet meltdown.

Amazon did not address a request for more clarity about why the data centre keeps being impacted. The problems stemmed from what is known as the domain name system, or DNS, which prevented applications from finding the correct address for AWS’s DynamoDB API, a cloud database relied on to store user information and other critical data.

ROOT CAUSE IS NETWORK HEALTH MONITOR

Earlier AWS said the root cause of the outage was an underlying subsystem that monitors the health of its network load balancers used to distribute traffic across several servers.

The issue, AWS said, originated from within the “EC2 internal network”, Amazon’s “elastic compute cloud” service, which provides on-demand cloud capacity within AWS.

Shortly after 3pm PT Amazon said: “All AWS services returned to normal operations. Some services such as AWS Config, Redshift and Connect continue to have a backlog of messages they will finish processing over the next few hours.”

Ken Birman, a computer science professor at Cornell University, said software developers need to build better fault tolerance. He said AWS provides tools developers can use to protect themselves in the event of a problem at one of any of its sprawling network of data centres, and developers can also create backups with other cloud providers.

“When people cut costs and cut corners to try to get an application up and forget they skipped that last step and didn’t really protect against an outage, the companies are the ones who ought to be scrutinised later,” Birman told Reuters.

ISSUE ORIGINATED FROM AWS SITE KNOWN FOR PREVIOUS OUTAGES

AWS provides computing power, data storage and other digital services to companies, governments and individuals and is the world’s largest cloud provider, followed by Microsoft’s Azure and Alphabet’s Google Cloud.

The outage again highlights the dependency we have on relatively fragile infrastructures

—  Jake Moore, global cybersecurity advisor at ESET

Disruptions to its servers can cause outages across websites and platforms, ranging from food delivery apps to gaming platforms and airline systems, that rely on its cloud infrastructure.

AWS said on its status page Monday’s outage originated at its US-East-1 location, its oldest and largest for web services. The site suffered outages in 2020 and 2021. According to documentation on the AWS website, the US-East-1 site is often the default region for many AWS services.

‘FRAGILE INFRASTRUCTURES’

The problem highlights how interconnected digital services have become and their reliance on a small number of global cloud providers, with one glitch wreaking havoc on business and day-to-day life, experts and academics said.

“The outage again highlights the dependency we have on relatively fragile infrastructures,” said Jake Moore, global cybersecurity advisor at European cybersecurity firm ESET.

In Britain, Lloyd Bank of Scotland and telecom service providers Vodafone and BT were hit, according to Downdetector’s UK website, as was UK tax, payments and customs authority HMRC’s website.

“The main reason for the issue is all the big companies have relied on only one service,” said Nishanth Sastry, director of research at the University of Surrey’s department of computer science.

Ookla, which owns Downdetector, said more than 4-million users reported issues due to the incident.

“For major businesses, hours of cloud downtime translate to millions in lost productivity and revenue,” said Ryan Griffin, US cyberpractice leader at insurance broker McGill and Partners.

Wall Street was largely unfazed, sending Amazon shares 1.6% higher to $216.48 (R3,739.32).

Ookla said at least 1,000 companies were affected by the outage. These included:

  • apps Reddit, Roblox, Snapchat and Duolingo;
  • artificial intelligence startup Perplexity;
  • cryptocurrency exchange Coinbase;
  • trading app Robinhood;
  • Amazon’s own services, including its shopping website, Prime Video and Alexa;
  • gaming platforms Fortnite, owned by Epic Games, Clash Royale and Clash of Clans; and
  • Uber rival Lyft in the US.

In a post on X, Signal president Meredith Whittaker confirmed the messaging app was hit by the outage, though billionaire Elon Musk, who owns X, said his platform continued to work.

Amazon remains the global cloud leader in terms of revenue.

Reuters


Would you like to comment on this article?
Sign up (it's quick and free) or sign in now.

Comment icon