
Liv McMahontechnology reporter and
Lily JamaliNorth American Technology Correspondent
Amazon Web Services (AWS) said late Monday that it has resolved a massive outage that took some of the world’s largest websites offline for an entire day.
More than 1,000 apps and websites, including social media platforms such as Snapchat and banks such as Lloyds and Halifax, were affected by the problem, which Amazon said was central to the cloud computing giant’s operations in the country.
Platform outage monitor DownDetector said Monday that user reports of problems during the outage surged to more than 11 million globally.
Even after Amazon fixed the underlying problems, experts said the outage exposed the dangers of too many companies becoming dependent on a single dominant supplier.
Professor Alan Woodward, from the University of Surrey, said: “What this episode highlights is how interdependent our infrastructure is.”
“Many online services rely on third parties for their physical infrastructure, which shows that problems can arise even with the largest third-party providers.
“Small errors, often made by humans, can have far-reaching and serious impacts.”
The problems appear to have started at 7am BST on Monday, when users began reporting problems accessing a number of platforms.
This included a variety of sites and services, from massive online games like Fortnite to language learning app Duolingo.
Earlier this morning, DownDetector told the BBC that it had received more than 4 million reports from users across 500 sites in just a few hours. This is more than double the amount seen throughout a regular weekday.
It later said these peaked at more than 11 million as more services, including Reddit and Lloyds Bank, attempted to recover.
Around 2300 BST, Amazon said all AWS services had “returned to normal operations”.
But that wasn’t the case before, when companies had to adjust parts of their own systems to fix underlying problems.
The initial outage may have been followed by a new series of “cascading errors,” according to Mike Chapple, a professor of information technology at Notre Dame University.
“It’s like when there’s a major power outage. Crews start working to get it running again,” Mr. Chapple said. He explained that “the power may blink a few times,” but it’s possible that Amazon was initially only “addressing the symptom” and not the cause.
What’s wrong?
Amazon has not yet fully explained what caused Monday’s outage or issued an official statement about it.
An update to the service status webpage said the issue “appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1.”
DNS, short for Domain Name System, is often likened to the phone book of the Internet.
This effectively converts the website names people use (e.g. bbc.co.uk) into numbers that computers can read and understand.
This process fundamentally underpins the way we use the Internet, and if it’s disrupted, your web browser may not be able to find the content you’re looking for.
Cloudflare CEO Matthew Prince told the BBC that the AWS outage highlighted the impact cloud services have on how the internet works.
“Everyone has had a bad day. Amazon had a bad day today, too,” he said.
“The amazing thing about the cloud is that it scales, but when an outage like this happens, it can disrupt many of the services we depend on.”
And Cori Crider, director of the Future of Technology Institute, told the BBC: “It felt like a bridge was collapsing.”
“An essential part of the economy has been shattered,” she said.
And with a large portion of cloud computing dependent on Amazon, Microsoft and Google (estimated at around 70%), she said the status quo was “unsustainable.”
“Once you have concentrated supply in the hands of a few monopoly suppliers, when problems like this arise, it becomes a huge part of the economy,” she said.
“We should try to buy more local services rather than relying on a few US proprietary platforms.
“This is a risk to our security, sovereignty and economy and we must look at structural decoupling to make our markets more resilient to these kinds of shocks,” he said.
One computer science expert says some of the blame lies with the companies that use AWS.
“Companies that use Amazon haven’t taken enough care to build protection systems into their applications,” says Ken Birman, a computer science professor at Cornell University in New York.
Outages like Monday’s occur frequently, though not always on this scale.
Birman told the BBC that app developers must be careful to back up mission-critical applications in the cloud.
“We know how to make these systems more robust and how to do it safely,” says Birman.
Issues of liability may end up in court.
More than a year after the massive CrowdStrike outage, Delta Airlines is still wrangling the company to recover more than $500 million in losses.
Even after CrowdStrike fixed the issue, the airline said it had to manually reset 40,000 servers, which caused severe flight delays for several days.
Additional reporting by Esylt Carr.