Amazon Web Services Inc. (AWS), experienced an outage in its AWS East Region (Ashburn), on Friday that reportedly affected hundreds enterprise services.
An Amazon spokesperson stated that there was a power outage at one of AWS’s redundant Internet connections points in Virginia, which caused connectivity issues for a small group of AWS customers using Direct Connect services in AWS US East. “AWS has resolved the issue and is working closely with its partner to prevent it from recurring.”
Although the “partner” was not named, ThousandEyes, an Internet monitoring company, reported that hundreds of critical enterprise services were affected by the outage, including Twilio, Slack, and Atlassian. According to Downdetector, Alexa, an AWS personal assistant that works with Amazon Echo speaker, was even affected by the outage.
This timeline was provided by the AWS Service Health Dashboard.
7:29 AM PST. We are investigating increased packet losses that could affect AWS Direct Connect customers in US-EAST-1.8:03 PM PST. We continue to work towards resolving this issue. Direct Connect connections from Equinix DC1-DC6 & DC10-DC12, Ashburn, VA, CoreSite VA1-VA2, Reston, VA are affected by this issue.9;43 AM PST Connections in CoreSite VA1-VA2 (Reston. VA) and some connections in Equinix DC1-DC6 & DC10-DC12 (Ashburn. VA) are inactive. Inactive connections aren’t receiving advertised routes from Direct Connect routers. We are currently working to restore service to these Direct Connect connections. The AWS VPN service is now operating normally and may be an option for some workloads. 11:21 AM PST On March 2, 2018, some customers in CoreSite VA1 and VA2 (Reston. VA) Direct Connect locations and Equinix DC1-DC6 & DC10-DC12 (Ashburn. VA) Direct Connect locations experienced a loss of connection to the US-EAST-1 region. The problem has been resolved and the service operates normally.
Ironically, the outage happened almost exactly one year after an outage that was widely reported and blamed on an incorrectly entered command.
ThousandEyes stated Friday’s outage was a stark reminder of the cloud’s vulnerability. Cloud First strategies are rapidly being adopted by enterprises, who move workloads to IaaS providers such as AWS. Many organizations don’t fully understand the unpredictable dependencies that accompany this shift.
According to the company, the outage affected more that 240 critical services that depend on AWS infrastructure.
ThousandEyes stated that this episode serves as a powerful reminder of the complexity and interconnectedness of the cloud. “Outages or natural disasters in one area of the cloud can quickly spread to other parts. Cloud vendors offer many ways to connect directly into their infrastructure. They do not protect you from external dependencies like the Internet. Although availability zones provide some redundancy, outages such as these can quickly overwhelm entire clusters of data center clusters.
The network monitoring specialist recommended that geographical redundancy be included in an organization’s fault tolerance strategy. He also suggested network monitoring services to help mitigate the impact of outages.
