Amazon explained what triggered the outage in a post-mortem published on its website Thursday.
Amazon.com’s cloud computing unit said that the outage that shook up a sizable part of the internet Tuesday was caused by human error.
The Amazon Web Services division said in a post-mortem published on its website Thursday that its team was working to fix a problem that slowed down the billing system for S3, a widely used AWS service.
Through S3, companies and individuals can store their data on Amazon’s server farms. S3 also houses the data that underpins a wide array of other AWS services, including some computing-processing functions. It works as a basic building block of Amazon’s cloud, which in turn is a major pillar of the modern internet.
To fix the slowdown issue, engineers in AWS’ Northern Virginia operation — one of the largest cluster of data centers run by the company — needed to take down a small number of servers.
Most Read Business Stories
- REI picks new satellite office ‘surrounded by trail networks’
- Judge upholds Seattle eviction regulations, rebuffing landlords' lawsuit
- Fry's Electronics executive accused of embezzling $65 million
- Funky electronics chain Fry's is no more
- Alaska Airlines ordered to pay $3.2M to family of woman who died after escalator fall
“Unfortunately,” as AWS put it in its lengthy mea culpa, a technician made a mistake when entering a command, taking out more servers than needed — some of which were critical to the functioning of S3 in the entire region. Thousands of users relying on AWS data and computing processes were affected.
AWS says that its system is designed to allow the removal of big chunks of its components “with little or no customer impact.” But the rebooting took a long time — longer than expected — AWS says, partly because the S3 service has become gigantic since it launched more than a decade ago.
From failure to complete recovery, the outage lasted slightly more than four hours, although other AWS services that had accumulated a backlog of work during the disruption took longer to recover.
AWS said the outage was prompting it to make some changes: for example, reducing the amount of server capacity that can be removed at one time.
“This will prevent an incorrect input from triggering a similar event in the future,” AWS said.