Amazon S3 Still Limping & The Limits of Clouds

by bob on February 15, 2008 · 4 comments

in Editorial

blog_logo c2.jpegIt’s Mad-Kitty time … looks like Amazon S3 suffered a fairly substantial, system-wide outage for several hours this morning. Michael Krigsman has a good post on the topic.

As of this writing Amazon is stating that they are no longer experiencing system-wide outages, but there are still some reports of individual outages.

Let’s assume that they get this sorted out over the rest of the day – what does this really mean?

Move On, Move On, …

Just like an old movie (you know the scenes, where some horrible crime has been committed and there’s bodies laying all over the street, the first police on the scene always implore the crowd to “move on, there’s nothing to see here” blah blah.

That pretty much describes Amazon’s response so far … minimal response, hope nobody notices. Just look at some of these comments from S3 customers as they woke up to find themselves down:

Hopefully we will soon get som answers from Amazon…Hi, what is the dead line to fix this inssue, because i have many clients using the S3 service.

Unfortunately I think this is a 100% error rate for us – S3 is just inaccessible. Please keep us posted – when you have an ETA (or figured out the problem/fix) please relay it to us. Given the SLA, etc. we are using S3 in a production environment.

I have to add a major ME TOO here. My business is effectively closed right now because Amazon did something wrong. I’ll have to reconsider using the service now.

having an idea of the ETA to resume ‘normal’ service is essential.

this is really a severe blow to confidence in trusting AWS services.

It goes on like this for pages on pages – and that’s only one forum.

Nicholas Carr thinks that this is just a to-be-expected glitch, an understandable lack of maturity in a forming industry – “I would expect that Amazon will roll out additional tools for monitoring service status and alerting users about problems in fairly short order”. I’m sure their tools will get better, but I’m not quite so sanguine as Carr.

Talk Big, Don’t Do So Much

I think that there are structural reasons for Amazon and other cloud providers to not convey too much information – in a system as big and complex as S3, it becomes very hard for individuals to understand their actual service levels. As I blogged a few months ago when Amazon first offered an SLA for S3, it’s really an SLA without any teeth – mostly a feel-good marketing move, nothing more.

Bottom Line

While I agree with Carr and others that much of what is going on in the name of cloud computing is meaningful, useful, and will continue to grow as an option for the astute purveyor of software-based services, I absolutely take exception to the hazy “they’ll take good care of us someday” thinking that seems to underly so many such discussions.

In an ironic twist of fate, Greg Olsen, founder and CTO of Coghead posted yesterday on how the only smart thing to do was to absolutely rely on stuff like EC2 and S3 for your new service. He talked about being able to “deploy a complex, highly available and scalable multi-user software application” and stuff like that. This is certainly mainstream, conventional wisdom – yet it is so incomplete.

After all, when your service is down because the storage cloud has blown your bits to kingdom come, what are you going to tell your customer? How about your board of directors? ” … but they PROMISED …”

Sounds pretty lame, doesn’t it?

{ 4 comments }

Anonymous February 15, 2008 at 2:59 pm

Everything has percentages of uptime. You can host on EC2/S3, or you can host on “your own” VPS (e.g., Slicehost), or you can host on your own equipment in a co-lo, or you can host on your own equipment in your own facility on a leased line.

All have uptime percentages. All have dependencies on third parties. The variables are what the uptime percentages are and how much control you have over them (e.g,. invest in a second leased line to deal with outages from the first carrier).

After all, when your service is down because the storage cloud has blown your bits to kingdom come, what are you going to tell your customer?

The same thing you’d tell your customer if your VPS provider had an outage. Or if your co-lo provider lost power and ran out of diesel. Or if somebody spilled coffee on your locally-installed server and caused it to flame out. In other words, having it in the “storage cloud” doesn’t change anything, except percentages and control.

darkuncle February 15, 2008 at 3:34 pm

ironic that one would have to consider adding redundant hosting and storage services, when S3/EC2 are themselves redundant, by nature … but as today’s outage shows, even relying on a single massively redundant service is, itself, a single point of failure. I’m not sure if there are any competitors in the S3/EC2 space, but if there are, I’d have some kind of hot standby systems available there, in the unlikely (?) event of a total Amazon outage. It would at least cover one’s due diligence requirements enough to avoid looking like an ass to one’s board/CEO/shareholders/customers/etc.

(that said, amazon’s failure to provide a complete and thorough breakdown of the technical details leading up to the outage, and what specifically was done to resolve it, is totally unacceptable. Not providing your customers with that level of explanation, immediately, is worse (IMO) than the outage itself. After all, outages are a fact of life – nothing is 100%. However, there is NO excuse for failing to come clean about exactly what caused it, and what was done to resolve it.)

Phil Easter February 15, 2008 at 7:40 pm

Last summer my 15 year old greeted me – “Dad! My cell phone is broke and I can’t text my friends!” “mmm.. this is serious. What did you use to do before I bought you the phone?” “I wasn’t able to text back then, dah!!”

Today’s uproar re: the Amazon S3 outage takes me back to that funny moment when my daughter finally got my point – that I enabled her to enjoy the world of texting. And, like many AWS bloggers today, she did not appreciate that I gave her this gift.

So, to put a big picture perspective on today’s outage – most of us start ups, if not for AWS, would have burned thru our angle and round A funds to replicate AWS before we would have hit the tipping point and had the luxury of telling our customers that “we are experiencing an outage.”

Looking back on my old school days of expensive networks, users running out of storage and the constant flow of cash to admin staff, I must admit to having a soft spot for the AWS team and service. In those days, a two hour outage was considered an opportunity for our users to chat with the cube neighbor or go down to the cafeteria for a donut. Fast forward to today’s demanding customers and an outage of minutes starts Armageddon. Now, imagine if by some miracle, these customers actually pay for the start up’s service.

Today, I welcomed the outage as it reinforced my need for AWS. How would my small team respond to an outage? We don’t have the talented staff nor the passion the AWS team has. We forget that Amazon is in the small group of visionary “start-ups” who helped get the net to where we are today.

Phil Easter
CTO/AirMe

Greg H February 16, 2008 at 7:18 am

OH jeez, show me a single shared online hosting service that has had 100% uptime and I’ll show you a herd of unicorns huddled around a pot of gold at the end of a rainbow. Their downtime was just two hours, not “several,” and that was the first time in a year that our assets weren’t accessible. For the cost of their service, that’s an extremely reasonable amount of downtime, but you seem like an extremely negative person that takes every problem as though it were the end of the world.

When my clients called to complain about their sites not working, I simply told them, “Amazon’s entire technical army is working as quickly as possible to resolve the issue and ensure that it doesn’t happen again. It may take a few hours to fix the problem, but they will find the problem and add additional checks to ensure that it doesn’t happen again.” I don’t have a single angry client.

I would’ve even said that to a “board of directors,” but if you get fired for a measly two hours of downtime, then they were looking for a reason to get rid of you anyway.

Comments on this entry are closed.

Previous post: Dilbert on EnergySaver!

Next post: Cool GridToday Article on XTP Platforms (XTPP)