Scaling Digg … Shards and the DB

Nice article on how Digg has handled their aggressive growth to date - well worth a read. Digg is a handy site, and I'm definitely glad that they've been scaling fine so far.

Having said that, it's pretty clear that their current data layer is definitely not simple, to say the least ... and is on the verge of becoming unmanageably complex, with all of the headaches and fragility that implies. In particular,

Digg’s current architecture includes about 20 database servers, 30 Web servers, and a few search servers running Lucene; the balance operate as backup servers. All but one of the database servers run some version of MySQL 5. The transaction-heavy servers as well as the backup units use the InnoDB database engine, while the OLAP ones use MyISAM.

While the number of servers is not too bad, their variety and interdependent nature conjures up visions of many late-night admin sessions, planned outages, and too many (well really one is too many!) beeper-induced-life-crunching-unplanned outages. If you can keep those ravenous beasts generally in the cage the admin load might be ok ... maybe. Let that one bubble for a bit ...

In the meantime, let's take a couple of steps back and consider what can be done to fundamentally scale apps like DIgg without stressing an increasingly fragile db infrastructure.

As Frank Sommers discusses in this post and Robert McIntosh contends in here, much can be done for apps like Digg without even thinking of the database. While their comments start along the right path (even though I don't agree with all of their observations), they necessarily pull up short of what's possible. You may be wondering why I say "necessarily pull up short" ... and that's definitely a fair question.

"Necessarily pull up short" because they're not actually interested in scale? "Necessarily pull up short" because they actually have deep-seated db fixations? Not at all!

The reason that each of these guys pulls up a bit short is that they're working up from the infrastructure layer, which limits the data and application abstractions with which they can work. Not their fault, actually, we're just in need more comprehensive data and application abstractions that can handle these types of scaling-friendly implementation structures without crushing the developer.

Back to the Digg article. One of the more interesting items mentionedThe Shards of Narsil is the increasing use of db shards at Digg. A shard is a kind of cool sounding name for a conceptually simple idea - shards are created by separating a db into separate pieces, generally along the natural fissures of the data. Sometimes those separate pieces will be put into separate db instances, sometimes on separate machines, sometimes in separate facilities, and sometimes in some combination of all of the above. In general, creating shards is a step down the path of increasing the orthogonality of the stored data, which is a good thing since with independent data elements you have more of a chance for creating useful scale.

At what cost, you may wonder? Well as you formalize these fissures you tend to lose some of the nice operations that relational dbs have always provided. Like joins for example. The thought of using a relational db without joins can give even the most confident app developer pause for thought ... kind of like trying to use the shards of this sword to defend yourself when it really matters. In addition, manual sharding tends to increase application complexity (more for the developer to do). In order to mitigate some of this complexity folks are starting to develop projects to straddle the fence - to give the developer the option of utilizing shards for storing objects while still implementing some of the operations (as often as possible) that would otherwise be lost with purely manual shards. A great example is the Hybernate Shards project.

So while it is clear that an increase in the data layer abstraction in which such things as shards, caching of various forms, and other non-db optimizations for scaling the data layer can be transparently (i.e. without complexity) utilized is both necessary and in progress, it will not be enough.

Why not?

These increased abstractions are not enough for two reasons. First, these increased abstractions generally do not solve the underlying operations issues. In fact, they may actually make them worse since app developers will start fragmenting data right and left, or maybe the need for scale may just force the fragmentation (think of the Digg example). Second, these increased abstractions ignore other areas of application complexity entirely, such as scaling and interaction of multiple threads, interoperation communication, process-level reliability, and so forth.

What is needed is a commensurate increase in the application abstraction, one which enables both the app developer and the operations folks to see an arbitrary collection of computers as one. One simple computing substrate that is both arbitrarily scalable and highly reliable.

That is the definition of an application fabric. And it is a perfect match for the good stuff that's happening in the data layer.

In upcoming posts I'll elaborate on what an application fabric is, how it aggressively reduces app complexity, and how that enables app architects and developers to make the most of advances in the data layer.


Technorati Tags: , , , , , , ,

Id like to build a scalable

Id like to build a scalable web app and I'm struggling with the 'scalability' part. So far, I haven't found a solution that is not complex. So I'll liek to read about your 'application fabric'.

James that's great ...

James that's great ... that's just the sort of stuff fabrics were made for. Either send me some contact info (my address is bob at the domain appistry.com) and I'll make sure that you get the info you need, or sign up for one of the whitepapers on the website and indicate this area of interest).

Keep me posted (either directly or in the comments) on how your investigation goes.

Thx.

[...] Scaling Digg… Shards

[...] Scaling Digg… Shards and DB: Another discussion on Digg here. [...]

Post new comment

The content of this field is kept private and will not be shown publicly.