Azure outage raises questions about public cloud for mission-critical apps
Microsoft Azure has been hit by yet another global outage, this time impacting even its own online Office services and websites. The outage raises serious questions about whether these kinds of public cloud platforms are ready for mission-critical workloads.
Early this morning or late yesterday, depending on where you’re based, Azure experienced connectivity issues to its services including storage, websites and visual studio online, with some users experiencing other issues accessing Sharepoint online and other Office apps. Xbox Live seems to have been hit with similar outages, too.
The outage has seemingly impacted every region except Australia, its latest availability zone addition, and has even briefly affected some of Microsoft’s own websites.
The company said it’s still in the process of investigating and resolving the issue. Most of the services are back online except for Application Insights for app monitoring, which is down globally. In North and Western Europe, customers are still experiencing issues with their virtual machines.
This is the second large scale Azure outage in less than three months. In August the company’s cloud services experienced significant outages in the US, Japan and Brazil simultaneously, which went on for nearly a week in some cases.
What are enterprises to make of this? Recently published research from Infosys suggests about four in five enterprises plan to move their mission-critical workloads into the cloud, which in light of recent events, is a worrying sign.
It has been about 12 hours since the outages began, and if you’re hosting an application in a virtual machine in Europe, you’re likely still experiencing severe latencies or an outage altogether, despite the fact that the Azure status page currently leads with “All good! Everything is running great”. Everything is not running great.
Most SLA’s for Azure allow for just over 50 minutes of downtime annually, which in many cases has already been eclipsed multiple times depending on which geography a user’s workload is hosted in. Microsoft hasn’t yet commented on how it will compensate impacted users, but if one were to go by its standard SLAs and previous outages it’s likely service credits are on the table.
Microsoft certainly isn’t alone in experiencing global outages of its cloud services – AWS users have had their fair share. With so many active regions one would think the level of redundancy built into Azure would be more robust. But there will come a time when the service credit approach, which is tantamount to saying “we’re sorry for the crap service – here’s some more crap service,” simply won’t placate customers.
At a time when so much commercial activity relies so heavily on online channels, when an increasing number of internal applications – for sales, marketing, HR, and productivity – are deployed via the cloud, and when so many of these applications still include complex service chains, these outages raise some serious questions about the limits of public cloud, in its seemingly infinite scalability and elasticity, for mission-critical applications.