Google’s Map Reduce and database dinosaurs

I just recently attended the Google Conference on Scalability in Redmond Washington where I got a chance to sit in on several sessions that discussed creating distributed applications at 'web scale'.

There were two great sessions on Map Reduce by Jeff Dean and Barry Brummit from Google. The general purpose of Map Reduce is to parallel-ize an operation (Map) across lots of distributed data and computers and then to aggregate the results (Reduce). The use of this algorithm is so pervasive in Google that they have created a general purpose Map Reduce infrastructure that allows them to use it for 100's of applications. Everything from the creation of key word search indexes of web content to the listing of every misspelling-permutation of Brittney Spears name is created using Map Reduce (which is ultimately used for the auto-complete feature in the Google toolbar).

Another presentation that I really felt made a lasting impression on me was the presentation by Werner Vogels the CTO of Amazon and Swami Sivasubramanian. The presentation focused on the challenges in building an infinitely scalable data store that Amazon uses called Dynamo. Werner started the presentation with a fairly controversial statement around 'Databases are Dinosaurs' at which time he went on to explain that the database technology, which really hasn't changed much in the last ~30 years, coupled with state management, availability and consistency were the dominant factors in limiting the scaling any large system.

Which really got me thinking about the number of customer examples I have seen lately where large traditional RDBMS's just can't query, correlate or ingest information at rates fast enough to support the needs of the business. Often times for the ingest problems, it comes down to doing data transformations (ETL) which the fabric does an excellent job almost out-of-the-box. But for the query and correlate problems that we see, I find it interesting that a lot of customers might find it acceptable to shorten the execution time for a process to hours verses days by parallelizing the work with a fabric and some traditional RDBMS solution. But wouldn't it be interesting to think about moving massive amounts of data into the data-fabric (Fabric Accessible Memory + Microsoft Linq) and taking an overnight process and completing it in near real time? I look forward to describing these solutions as they are completed…

We have been called the Googlization for the rest of us. Are you ready for near real time?

Until next time..

Mark

 

Technorati : , , , , ,

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
1 + 0 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.