cloud outages image credits

Alleviating cloud outages in advance, valuable for companies

August 14, 2017

Cloud outages are a dreadful perspective to contemplate. Many businesses have trusted their data storage, computing operations and communications to public or hybrid cloud service providers. Of course, private cloud (and by this we mean on-premises) can also suffer outages. However when the causality lies within the enterprise, the process of reducing the risks or alleviating the effects in advance is part of a more complex insider strategy.

Therefore we try to approach the issue of preparing for public cloud outages in advance. These outages are not in the least controllable from within the company. Enterprises that are cloud providers’ clients basically fall into the mere role of cloud users. They find themselves taken by surprise when the cloud is out, if not completely infuriated.

The main goal is minimizing the losses. Even when responsibility and liability are established and followed through, some consequences can prove irreparable. Therefore, what can companies do to prepare themselves for eventual cloud outages? How can they diminish the unpleasant consequences?

Learning from cloud outages

It is always better to learn from the mistakes or misshapens of others than to go through the same issues yourself. An AWS cloud outage for example became the base for a specialized analysis. The question is what companies can do to prevent or minimize the losses. Proactive measures consist of:

  • Determining where your data is located (multiple locations are preferable to a single one, since an occurring outage could completely block the access to enterprise data stored in a single location, whereas multiple deployment allows second device fail-over and operation continuity);
  • Consider smaller cloud service providers, which are eager to showcase differentiating technologies and services; explore the provider’s service pack;
  • Design your cloud-based computer system with redundancy and fault tolerance in mind; building resiliency into the software applications and the developing operations could save companies a lot of troubles.

Easy steps in preparing in advance

The same stance mentioned above (taking measures in advance and strategizing for better surviving possible outages) can be broke down in a few essential steps:

  • Study your provider’s backup policy in detail before signing up for its services;
  • Maintain your self-reliance by instituting your own on-premises backup;
  • Consider (and adopt) hybrid cloud solutions;
  • Perform regular test restores for data recovery in the event of an outage.

Strategize, or else… (Netflix’s example)

The same AWS outage from 2015 made some users rethink their strategies in order to fight possible disruptions. The approach qualifies as “chaos engineering” tactic.

In fact, this strategy is beyond implementing theoretical measures. The companies that employ such methods induce failures into their systems to simulate naturally occurring incidents. The purpose is to determine what are the problems organizations should solve. Netflix may well inspire others in what this type of approach is concerned. Its declared goal was to make the systems “resilient to any of the underlying dependencies”.

Of course, this approach involves extra costs, most of the related to the storage tier duplication. Simulations need to take place on an extra technology layer – be it a duplicate software layer. The reason is not to risk disturbing real-time operations.

Some pointed out that the provider should have intervened to redirect the affected cloud services to a different data center as soon as possible. However, the cloud service customers cannot afford to wait until this policy is enforced. Their own operations and customers depend on a quick comeback. Duplicating data and testing on-premises recovery measures might be costly, but it can prove extremely useful.

Takeaways on ensuring company data is always available

Preparing for the worst is something organizations may find unpleasant. Yet, especially when handpicking the most reliable cloud providers, this way of thinking is the most precautionary. Organizations can minimize cloud outages’ effects by their own forces. A few important takeaways would be:

  • Never rely solely on Service Level Agreements, instead complete your own disaster recovery plan;
  • Prepare an “service escrow” component to the above-mentioned plan;
  • Make your entire staff active components of the Service Disaster Recovery Plan;
  • Ensure efficient data access, such as end user backups and data storage in multiple locations; there are workarounds that can give at least partial user access when the system is hit by a cloud outage, but core data-sets access is essential for this;
  • Prepare multiple communication channels that would ensure updates and status reports are promptly being delivered to your customers in case of an unpleasant event, in order to safeguard at least the customer relations part of your reputation; perhaps cloud outages cannot be helped if they don’t depend on your company, but the way the aftermath is handled can serve in smoothing out the consequences and showing that you are doing all that is possible to speed up the remediation;
  • One of the most important steps in advance preparation consists of fully understanding the way cloud computing works beforehand; it is indeed critical to get informed in time, to conduct case scenarios analysis in order to fully anticipate what might happen in case of an outage, as well as to have your bases covered in what your own customer’s questions are concerned.