Brian Wood Blog

By Caron Carlson in FierceCIO covering an article by Brandon Butler in Network World (also re-posted below).

Considering AWS for your cloud? Please read below and also check out AIS CTO Steve Wallace’s recent blog post.

Emphasis in red added by me.

Brian Wood, VP Marketing


Gartner: AWS, HP have worst cloud SLAs

The award for the “worst SLA” of any major cloud provider goes to Amazon (NASDAQ: AMZN) Web Services, according to Gartner analyst Lydia Leong, but HP’s (NYSE: HPQ) public cloud SLA is not necessarily better. Both providers have strict requirements for users in architecting their cloud systems if they want the SLAs to apply when service is disrupted, reports Brandon Butler at Network World.

Amazon’s SLA gives enterprises heartburn,” Leong wrote in a recent blog post. “HP had the opportunity to do significantly better here, and hasn’t.”

In addition to the costly service architecture requirements, the SLAs for AWS and HP cloud services are unnecessarily complex and limited, she complained. For example, neither SLA covers block storage services.

For the AWS SLA to take effect, customers are required to build their systems so that applications run across a minimum of two of the provider’s data centers, known as availability zones. The HP SLA only goes into effect if all of its availability zones are down, which means customers have to build their applications to cover three or more availability zones.

The costs of running applications across more than one availability zone add up. What’s more, Leong wrote, the requirements add complexity to the systems. “Most people are reasonably familiar with the architectural patterns for two data centers; once you add a third and more, you’re further departing from people’s comfort zones, and all HP has to do is to decide they want to add another AZ in order to essentially force you to do another bit of storage replication if you want to have an SLA,” she wrote.

Because of the service architecture requirements, it isn’t likely that AWS or HP customers would be reimbursed for downtime in a meaningful way under the SLAs, Leong warned. However, other infrastructure-as-a-service providers offer considerably better SLAs.


Gartner: AWS, HP have worst cloud SLAs

Gartner analyst Lyida Leong says cloud market leader AWS also has one of the worst SLAs, but HP’s new offering is giving it a run for its money

Amazon Web Services, which Gartner recently named a market-leader in infrastructure as a service cloud computing, has the “dubious status of ‘worst SLA (service level agreement) of any major cloud provider'” analyst Lydia Leong blogged today,but HP’s newly available public cloud service could be even worse..

HP launched the general availability of its HP Compute Cloud on Wednesday along with an SLA. Both AWS and HP impose strict guidelines in how users must architect their cloud systems for the SLAs to apply in the case of service disruptions, leading to increased costs for users.

AWS’s, for example, requires customers to have their applications run across at least two availability zones (AZ), which are physically separate data centers that host the company’s cloud services. Both AZs must be unavailable for the SLA to kick in. HP’s SLA, Leong reports, only applies if customers cannot access any AZs. That means customers have to potentially architect their applications to span three or more AZs, each one imposing additional costs on the business. “Amazon’s SLA gives enterprises heartburn. HP had the opportunity to do significantly better here, and hasn’t. To me, it’s a toss-up which SLA is worse,” Leong writes.

Cloud SLAs are an important topic, as recent outages from providers like AWS have shown. AWS has experienced three major outages in the past two years, including a recent one that took down sites such as Reddit, Imgur and AirBNB. Each of AWS’s outages have been limited in scope, however, and have mostly centered around the company’s Northern Virginia US-East region.

AWS’s policy of requiring users to run services across multiple AZs costs users more money than if applications are running in a single AZ. “Every AZ that a customer chooses to run in effectively imposes a cost,” Leong writes. HP’s SLA, which requires all of the AZs to be down before the SLA applies leaves customers vulnerable, she says. “Most people are reasonably familiar with the architectural patterns for two data centers; once you add a third and more, you’re further departing from people’s comfort zones, and all HP has to do is to decide they want to add another AZ in order to essentially force you to do another bit of storage replication if you want to have an SLA.”

The SLA requirements basically render the agreements useless. “Customers should expect that the likelihood of a meaningful giveback is basically nil,” she says. If users are truly interested in protecting their systems and received financial compensation for downtime events, she recommends investigating cyber risk insurance, which will protect cloud-based assets. AWS has recently allowed insurance inspectors into its facilities to inspect its data centers for such insurance claims, she notes.

A strict requirement of service architecture isn’t the only aspect of the SLAs Leong takes issue with. They’re unnecessarily complex, calling them “word salads,” and limited in scope. For example, both AWS and HP SLAs cover virtual machine instances, not block storage services, which are popular features used by enterprise customers. AWS’s most recent outage impacted its Elastic Block Storage (EBS) service specifically, which is not covered by the SLA. “If the storage isn’t available, it doesn’t matter if the virtual machine is happily up and running — it can’t do anything useful,” Leong writes.

Leong does note that AWS has voluntarily refunded customers impacted by major downtime events even when the SLA did not require it.

Not all IaaS cloud SLAs are as bad as AWS and HP, she concludes. “The norm in the IaaS competition is actually strong SLAs with decent givebacks, that don’t require you to run in multiple data centers,” she writes. Data Dimension, for example, has a per-VM SLA with 100% uptime. That compares to a 99.95% uptime guarantee from AWS and HP, which only kicks in after at least five minutes of an outage. AWS and HP also have caps of how much of a percent of a customer’s bill can be refunded during a downtime, whereas Dimension Data will refund up to 100%.

HP did not immediately provide a comment in response to Leong’s claims.


Here is an updated post by Gartner analyst Lydia Leong based on new info from HP:

Some clarifications on HP’s SLA

I corresponded with some members of the HP cloud team in email, and then colleagues and I spoke with HP on the phone, after my last blog post called, “Cloud IaaS SLAs can be Meaningless“. HP provided some useful clarifications, which I’ll detail below, but I haven’t changed my fundamental opinion, although arguably the nuances make the HP SLA slightly better than the AWS SLA.

The most significant difference between the SLAs is that the HP’s SLA is intended to cover a single-instance failure, where you can’t replace that single instance; AWS requires that all of your instances in at least two AZs be unavailable. HP requires that you try to re-launch that instance in a different AZ, but a failure of that launch attempt in any of the other AZs in the region will be considered downtime. You do not need to be running in two AZs all the time in order to get the SLA; for the purposes of the SLA clause requiring two AZs, the launch attempt into a second AZ counts.

HP begins counting downtime when, post-instance-failure, you make the launch API call that is destined to fail — downtime begins to accrue 6 minutes after you make that unsuccessful API call. (To be clear, the clock starts when you issue the API call, not when the call has actually failed, from what I understand.) When the downtime clock stops is unclear, though — it stops when the customer has managed to successfully re-launch a replacement instance, but there’s no clarity regarding the customer’s responsibility for retry intervals.

(In discussion with HP, I raised the issue of this potentially resulting in customers hammering the control plane with requests in mass outages, along with intervals when the control plane might have degraded response and some calls succeed while others fail, etc. — i.e., the unclear determination of when downtime ends, and whether customers trying to fulfill SLA responsibilities contribute to making an outage worse. HP was unable to provide a clear answer to this, other than to discuss future plans for greater monitoring transparency, and automation.)

I’ve read an awful lot of SLAs over the years — cloud IaaS SLAs, as well as SLAs for a whole bunch of other types of services, cloud and non-cloud. The best SLAs are plain-language comprehensible. The best don’t even need examples for illustration, although it can be useful to illustrate anything more complicated. Both HP and AWS sin in this regard, and frankly, many providers who have good SLAs still force you through a tangle of verbiage to figure out what they intend. Moreover, most customers are fundamentally interested in solution SLAs — “is my stuff working”, regardless of what elements have failed. Even in the world of cloud-native architecture, this matters — one just has to look at the impact of EBS and ELB issues in previous AWS outages to see why.