Hi ElephantDrive Nation,
We are disappointed to report that we are continuing to experience major problems as a result of technical issues from our instance at Amazon Web Services (AWS) EC2. The outage has sent shockwaves through cyberspace, bringing other notable web sites and services to degraded levels of performance or outright failure, in spite of having multiple layers of redundancy in the affected Zone.
We are working hard to get more information from the AWS team and will share that with our community as soon as we have it. Unfortunately, they are being rather tight-lipped about the nature of the problem and steps/time to resolution, other than to note that the expect to restore service “in a matter of hours” and that they do not anticipate any data loss.
Their status updates, which we are using to build our recovery model along with guidance from their support staff, are publicly available here: AWS Health Dashboard.
While we anticipate full recovery by the AWS systems, we are taking steps now to release an alternative infrastructure. Regardless of the recovery, we will continue this effort so that any future outage like this can be avoided. While only a rare set of circumstances could have produced this failure, we need to be better and implement full systems redundancy across cloud providers.
Additional coverage on the scope of the incident:
Wall Street Journal: Amazon Cloud Service Suffers Errors, Hurting Websites
Business Week: Amazon Web Services Disruption Knocks Customer Sites Offline
ZD Net: Amazon’s N. Virginia EC2 cluster down, ‘networking event’ triggered problems