Planning for an Outage

Brian Wood Blog

Here is another relevant article published on Oct 18 from 451 Research analysts Jim Grogan and Kelly Morgan — and it points to why AIS considers compliance and SOC 1, 2, 3 audits so crucial for ensuring our enterprise-class standards are met and maintained.

The article brings to mind an adage I learned back in my Navy days about aircraft mishaps and the constant requirement to remain vigilant: “There are those who have and those who will.”

It’s the same with IT infrastructure outages. What matters is how the organization plans to failover, recover, investigate, mitigate, disseminate the findings, and carry on — wiser, even more vigilant, even more prepared.

The last line sums it all up: “Failure to plan for outages is planning to fail.”

Emphasis in red added by me.

Brian Wood, VP Marketing


Why You Shouldn’t Be Surprised When Technology Fails

“The headline reads, “[Fill in the blank] customers disrupted by outage”; are you surprised? T1R hopes not. For all the upside to business growth, a knowledge economy and the general comfort of social living that technology enables, it is a basic fact that technology will break.

All hardware has a fixed lifespan – even if that specific lifespan is unknown. Since the earliest days of computing, software has demonstrated unexpected results at the absolute worst times, such as during peak business cycles or when e-commerce volume reaches new heights.

As Rachel Chalmers told attendees at the recent 451 Group Hosting and Cloud Transformation Summit, we live during a time when there now exists an Internet of Things. Devices, from smart phones to RFID chip-enabled inventory all depend on technology hosted within enterprises, and more and more within MTDC environments to be effective and achieve their intended purpose.

In the face of the certainty of technology outages, how should risk management be considered?

Technology and Risk Assessment

Risk managers like to speak of ‘residual risk,’ the risk that is left after reasonable efforts to mitigate known threats. Organizations deal with residual risk either by insuring against the potential impact, by accepting the potential impact or by further mitigating that risk, if necessary, so that the potential impact falls within acceptable levels.

Perhaps a more interesting element to consider is ‘inherent risk.’ Inherent risk is present within all business processes, within all technology; in fact, it is present within all of life. A key to understanding the concept of inherent risk is to understand that it exists as a risk even if you have not looked for it, named it or attempted to mitigate its potential impact. There exists tremendous inherent risk within the context of MTDC computing. The goal of every risk management program is to identify the most likely risks, and once identified, to thoughtfully mitigate these based on a priority scheme that helps to ensure that the leadership team of an organization can accept the risk that remains. Simply put, this is what happens when a business impact analysis and a risk assessment is performed.

Ostrich or Microscope?

Organizations have choices in dealing with risk: at the extremes, you can follow the ostrich approach and bury your head in the sand until danger moves away, or you can put your organization under a microscope and attempt to understand and name every potential risk to your continued operation. Finding the middle ground is the challenge of every business leader.

Additionally, business management knows that there are obligations to assess risks in many cases. These obligations may be driven by regulations within a certain vertical industry, such as financial services, banking or healthcare; obligations may be due to the public nature of a business ownership, where federal regulations demand due diligence to protect the investment of stockholders. Finally, managing risks may be essential in order to meet commercial obligations. The question becomes how to accomplish the goal of managing risk.

Standards as Roadmaps

For the past several years, the British Standards Institute 25999 stood as a guidepost for organizations as to how risk should be considered. In May 2012, this standard was to a degree overshadowed with the approval and publication of the International Standard Societal Security – Business Continuity Management Systems – Requirements (ISO 22301). This standard has particular value in that the approval process included the formulation of the framework with the review and approval of global experts on continuity practices. This international standard calls both for business impact analysis (BIA) and risk assessments to be performed and to be kept up to date (Section 8.2).

Why are such assessments important? Besides the fact that they serve as a foundation for determining what risks ought to be mitigated and to what degree, for many industries they are also a legal requirement. HIPAA/HITECH specify risk assessments as ‘required’ (45 CFR 164.308); financial organizations are required to perform and maintain risk assessments under both SEC and FFIEC guidelines. On the international stage, the Basel II Accord mandates similar assessment, controls and management of operational risk.

Following a framework that has international recognition is a safety net for many organizations, whether they operate locally, regionally or across the globe. The inverse is also true: ignoring the practical wisdom found in this and similar standards leaves an organization in an indefensible position should a technology failure affect the continuity of operations.

A fact to consider if you are still uncertain whether your organization is at risk: the most well-run, most highly vigilant organizations experience technology outages on a regular basis.

Planning for an Outage

Not every outage needs to become a disaster. We live in an era where resilient engineering has become the norm for mission-critical systems and applications. MTDC vendors across the globe host mission-critical applications, and they partner with their customers to effectively manage the risk of outages.

Recently, AWS made the news as packets within its network were being dropped, causing some customers to lose effective processing capability. In September, some customers of The Go Daddy Group were affected by an outage. In June, customers of large and small MTDC vendors were at risk in the mid-Atlantic when severe storms disrupted both power and telecommunication lines.

The key point is that not all customers were affected. In conversation with MTDC vendors large and small, the trend they see across their customer base is the trend toward engineering resilient configurations to help ensure continuous operation.

Redundant power helps to ensure continuous operation when an outage occurs from one power feed; redundant networks help ensure alternate paths out of the datacenter when an outage occurs on one carrier. Storage and server replication help to anticipate that processing from an alternate location seamlessly satisfies application demands when a primary location experiences a failure. In each example cited, an outage occurs and processing continues; redundant engineering becomes an appropriate technology investment based on the BIA and risk assessment. In these examples, organizations have planned for an outage effectively.

Service Level Agreements Are Not Engineering Tools

A common misconception is that service level agreements offered by vendors ensure system availability. By themselves, SLAs are not engineering tools that guarantee availability, but rather they stand as commercial agreements that provide for recourse when an outage occurs. Whether an SLA offers 99.5% or 100% availability is simply a measure of the business risk a vendor is willing to assume. With that in mind, the higher the availability SLA, the more likely it will be that the vendor has established procedures that improve operational availability.

It is incumbent upon MTDC vendors to expect questions concerning how they meet any SLA thresholds, and to anticipate buyers’ interest in understanding the underlying operational and business procedures that support high levels of availability.

Similarly, it is incumbent upon the customer when contracting for MTDC services to take the time to engineer reasonable configurations to meet their availability goals. A single colocation facility may be sufficient for an application that can tolerate occasional downtime, and that may be protected by disaster-recovery procedures to restore operations within 12-24 hours. For mission-critical applications, which demand less downtime – or with the goal of zero downtime – then at a minimum the application configuration calls for multiple sites, engineered with appropriate levels of storage and server replication to permit continuous operation in the face of an outage affecting one location. For certain life-or-death or ‘bet-the-business’ applications, prudent risk mitigation may call for more than two processing locations to allow for the potential cascading effect of failures.

T1R Take

It’s a fact that both enterprise and MTDC systems will experience outages. When the news reports are announcing the outage, it doesn’t matter whether the root cause was Mother Nature or human error; what matters most at that time is whether each customer, in conjunction with its trusted vendors, has planned for failures with appropriate levels of resilience engineered into its systems.

Failure to plan for outages is planning to fail.”