Christmas Eve AWS outage stings Netflix but not Amazon Prime
Latest outage raises more questions about Amazon cloud
Heroku, if you were a character on South Park, I would call you Kenny.Every time AWS has an episode you're killed. bit.ly/U7pu9wAs you may be aware Amazon AWS US-EAST-1 experienced two outages in June that resulted in widespread service interruptions and significant downtime for AWS marquis customers such as Netflix, Pinterest, Istagram, Heroku.
— Brian McCallion (@BrianMcCallion) December 25, 2012
While some of the Cloud community and analyst community may rationalize the outages in an attempt to “protect Cloud” my approach is to take a hard line with Amazon and to place the outages squarely in Amazon’s court. In my opinion the outages were avoidable and that Amazon’s datacenters suffered from a design or engineering flaw that resulted in not just one but two outages in June. And beyond the technical reasons for directing the issue to Amazon, my understanding is that effective public relations in the face of serious events is to accept responsibility, and work to remedy the issue so it doesn't happen again. For the Cloud to evolve into an enterprise technology such issues need to be addressed by the providers, and failures need not be rationalized, or excused.
I strive for and recommend the “design for failure” approach to Cloud and systems architecture. Yet I believe that if failures can be avoided by exercising design for failure at the datacenter level then Amazon failed to effectively execute the “design for failure principles” espoused by Werner Vogels. In the case of the June 14th and June 29th outages at Amazon US-EAST-1 I believe the outages could have been prevented the first time if Amazon’s datacenter had been able to run on generator power.
In the case of the June 29th outage the datacenter(s) lost power in much the same way and yet again no generator power to keep the systems available. The fact remains that other datacenters in the Ashburn Virginia region lost grid and continued to operate with no issue running on generator power until power was restored.
In my opinion Amazon needs to follow its own design principles and to avoid failing the same way twice, especially when standard datacenter design should have eliminated the first of the two failures.