How quick were you to answer that question? Is it the tenant? Is it the colocation or cloud provider? Or, does it all need to be spelled out in an SLA? If you picked one of those responses, you’re not wrong. But you may need some clarification.
Over the past few years I’ve seen easy-to-fix efficiency and even resiliency issues within data center environmental management go unnoticed. Will some of these issues cause an outage? No, probably not. But, will a missing blanking panel or an improperly sized air handler cause other issues? Most likely it will.
At a recent Data Center Dynamics conference in San Francisco, I was involved in very lively conversations around data center management and responsibility. Just because AFM is spelled out in an SLA doesn’t mean it’s actually designed properly. Sure, the data center facility itself is operating well, but who is in charge of your racks? Have you really done a good job creating and efficient airflow architecture for your data center?
There are Different Kinds of Data Center Partners
I’ve learned quickly that there are a lot of data centers out there. Many will offer unique services like data migration or backup, while others position themselves as leading hyperscale cloud providers. The point is that none of them are built the same and each could have their own unique management structure. I’ve seen data center partners that are super hands-on during the entire migration and engineering process. Others, simply send a security guard along with you to your cage and let you work. They’ll provide the space, power, and cooling. But the rack design and buildout are all up to you.
It’s a Joint Responsibility
The big however is that the customer or tenant needs to remain constantly vigilant. Or, be sure to work with a partner that has efficiency and design in their DNA. It’s important to remember that not everything is always within your control. In 2017, the Azure Cloud in Japan experienced a massive outage. What happened?
Design of the cooling system and the power distribution system had typical redundancy built in for backup. The cooling system is N+1, meaning there is an extra cooling unit available in case one fails. The power distribution system was running at N+2, but one UPS in the parallel N+2 lineup failed and power was cut off to the entire cooling system in the data center.
From there, a long list of services was impacted. This includes both storage and virtual machines, along with many more cloud services, such as Web Apps, Backup, HDInsight, Key Vault, and Site Recovery. Issues included unavailability of virtual machines and VM reboots.
“Engineers have identified the underlying cause as loss of cooling which caused some resources to undergo an automated shutdown to avoid overheating and ensure data integrity and resilience,” the Microsoft Azure Service status page statement a little bit after the outage happened.
It’s important to note that the data center is managed by a third-party vendor, not Microsoft. Continue reading