Amazon Web Services Outage was Caused by One Typo

By

Earlier this week we talked a bit about an Amazon Web Services outage that took down a big chunk of the Internet including some of the most popular sites on the web. That outage lasted about four hours and when the issue was first reported we had no idea what cause the issue. The assumption was some sort of software issue during an update.

Amazon has come back and said that it’s Simple Storage Service or S3 that went down ultimately resulted in about 150,000 websites across the internet being unable to call on any objects that were on that cloud service. That means that rather than having entire websites down, the most common issue during the outage was that sites had lots of dead links as content as unable to load.

Amazon has stepped up and said what the cause of the outage was. On Tuesday morning, the day of the outage, Amazon was working on a solution to a problem that was slowing down the S3 billing system. Amazon notes that at 9:37 am Pacific one of the team members ran a command meant to take some of the S3 servers offline.

That command had a typo in it that caused a larger number of servers to be taken down than was meant and two of those servers were responsible for important systems for the entire East Coast region. To get the servers going again takes much longer than a normal computer reboot according to Amazon.

The rub was that it had been “many years” since Amazon did a full restart on the main subsystems that went offline. The delay in getting things back up and running was due to the safety checks that had to be run and checking that files weren’t corrupted in the unexpected shutdown.

The software tool that allowed the mistake to happen has been rewritten to prevent this sort of thing from happening again. From shutdown to full service restoration took four hours and 17 minutes according to Amazon.