As I have commented before, I am a big fan of Google's basic approach to scalable computing. There is much to like - Captain Enormous scale on commodity gear, rapid deployment of applications, and so on.
Yet it is by no means perfect.
In particular, there is a chronic level of failure in (at least) some of the flagship services that should not be acceptable in any modern day offering, least of all something which is a standard part of many people's workflows.
For example, about a week ago I (an involuntary testing army of one!) had one day discombobulated due to a series of failures in Google apps. Now today gmail is down (at least for me), and has been for over an hour. I was in the middle of sending an email when it quit, going to this screen:

which times out, then goes to a "waiting to retry message", then occasionally goes to this:

and then back again to waiting. Waiting but not working.
It's important to note that google reader (my current favorite rss reader) and google groups (likewise) have both been working all morning, at least for me.
Chronic Partial Failures are Typical
I have no idea how widespread either the gmail outages today, or the rolling app outages of last week are / were. Even worse, I don't think our industry even has a good way to measure this phenomena.
It's really a similar problem to the power utility industry. Everybody pays attention to the widespread outages - for example, last winter my family, along with more than 500,000 of our closest friends, were without power for four days after an ice storm ... that was a startling, yet beautifully surreal experience in itself ... perhaps a post for another day! - but not so much is even discussed about the much more common partial failures.
These partial failures effect some customers for part of the time, perhaps for one operation that didn't work out, or perhaps stretching out for hours or even days. Unfortunately, I think this type of partial failure is typical in any type of scaled-out system.
It's Not OK
All too often these sorts of failures (ESPECIALLY the partial ones) are waved off with cavalier comments of "typical", or "what can you expect for free", or some other such garbage.
That might have been OK when this was all new, and everybody was just thinking about how cool it was to have an (actually) usable service out there in the cloud, and wasn't this all just great.
That was yesterday and this is today. It really isn't OK for SaaS services to work sometimes and not work others ... even the free ones.
And what about the truly enterprise applications?
What Can We Do?
This is precisely why we have been working relentlessly for the past six years to create a simple computing world that simply works. Ensuring reliability and scale at the architectural level (without requiring developers and operations folks to do a bunch of stuff each time) is absolutely essential to raising the bar on what we can all expect from scaled-out systems.
Do we need better metrics? Yes. Do we need SLAs with teeth? Absolutely. But we need more, much more - we need to deploy true fabric-like architectures, especially those suitable for the enterprise, and we need them now.
Quick update: in the time that it took me to write this post gmail is back up. At one level that's good, but at another it's not - not if the problem is ignored as if it doesn't exist.
Update #2 - three hours later - broken again! (this time harder) Groups and reader still working mostly OK, though some transitory weirdness in reader.









