ETech '07 Summary - Part 2 - MegaData

Here's the thing, we need a new kind of data store, a new kind of SQL, something that does for storing and querying large amounts of data what SQL did for normalized data.

Update: Follow-up here.

Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.

Examples in the wild

Here are some of the examples I've seen recently, most of which was brought together by my attendance at ETech:

Jeff Jonas
The work Jeff does is pretty amazing and you should read this paper to get an idea of the kinds of analysis he's doing. One thing you don't see in that paper, but I did see on one of his slides at ETech, was the limits you need to put on yourself when storing a billion rows in a database, and they included: no joins, no transactions, no stored procedures, and no triggers.
Joshua Schacter
Joshua has similar suggestions from his experience building no joins, no transactions, no autoincrement.
I have talked previously about BigTable, Google's column-based store with no transactions. Yahoo! has their own technology similar to BigTable.
Slides from Randy Shoup and Dan Pritchett show how eBay runs with referential integrity, joins, and sorting moved out of the database and into the application.

I'm detecting a pattern here.

Common Themes

Some common themes are emerging. If you want to scale to the petabyte level, or the billion requests a day, you need to be:

The data has to be distributed across multiple machines.
No joins, and no referential integrity, at least at the data store level.
No one said this explicily, but I presume there is a lot of de-normalization going on if you are avoiding joins.
No transactions

Those constraints represent something fundamentally different from a relational database.

Who would buy that?

If we build something like that who would use it? There are only so many Googles and eBays in the world, right?.

Thomas Watson: I think there is a world market for maybe five computers.

We are barely scratching the surface of data today. More and more people are going to be going through this, which is what I've been pointing out lately, for example, when I asked how many motors and things with IP addresses you had in your house. We're already averaging 12 IPs per household. When we counted motors we got around 80 to 100. There are going to be more and more devices with IP addresses in your house, and many of those devices will actually be generating data, continuously.

Here is just one simple example. How often is your electrical usage measured? Once a month? What if it was measured every second? Combine that the Jeff Jonas anaylysis/anonymization work and Jeff Hawkin's Hierarchical Temporal Memory and do you have something that might be able to predict energy consumption? Maybe pick out some abberations on the grid? Preemptively offer you discounts for peak periods?

I don't know what's possible, or even useful, when it comes to monitoring and distributing electricity. The only thing I do know is that there's a lot of potential for disruption here.

But let's not get sidetracked by the example, this isn't a call for green tech, or an intelligent grid. This is a call to think about the data collection and processing farms that are going to have to be built, and the kind of data store required, to use all that data.

There are obvious privacy, security, and social issues of all this data collection and this has to be thought through before building these systems. For just one example, think about the publicity when Al Gore's utility bill was made public. That was just a simple set of monthly measurements of his gas and electricity usage.

Think of the developers!

So what about the poor developers that need to develop on top this rather strange platform, how will they fare? From Bigtable: A Distributed Storage System for Structured Data

Given the unusual interface to Bigtable, an interest- ing question is how difficult it has been for our users to adapt to using it. New users are sometimes uncertain of how to best use the Bigtable interface, particularly if they are accustomed to using relational databases that support general-purpose transactions. Nevertheless, the fact that many Google products successfully use Bigtable demon- strates that our design works well in practice.

The only difference between today and two years ago when Adam Bosworth gave his talk Database Requirements in the Age of Scalable Services is that there's a lot more public knowledge about what the likes of Google and eBay are doing.

...we need a new kind of data store, a new kind of SQL, something that does for storing andquerying large amounts of data what SQL did for normalized data

How about RDF and SPARQL? Doesn't "Distributed, Joinless, De-Normalized, and Transcationless" pretty much describe RDF + SPARQL + distributed data sources from around the web? There are implementations capable of handling millions of triples (see,, or, for example).

Way back in the day, Bill de hÓra said:

I'm telling myself I'm using RDF+XML because I want to be able to pull data in from anywhere. That's true, but to be brutally honest I can't be bothered designing and maintaining yet another relational schema for yet another webapp - doing so is starting to make as much sense as designing my own filesystem or TP monitor. Life's too short, too short to be working on technology that can only possibly make sense when you're in dressed in combats and vans listening to Pearljam pretending it's still the nineties... there's a real wish to conduct oneself at a higher level of abstraction before complete dementia sets in. What's the point in designing tables for a webapp when an RDF-backed store will manage the data for you and RDF queries will come back as tabular data anyway?

Posted by Darren Chamberlain on 2007-04-05

"Think of the developers!"

To heck with 'em. I just want someone to figure out how to manage my own personal terabyte's worth: backup driven development.

Pilgrim had a much better rant a while back about backing up. It's hard to be sure you have everything backed up. Or worse you end up with duplicated files on each computer (gee thanks, itunes).

I really am no more than 6 months away from needing a personal datacenter. I've been putting it off for about 2 years. And I don't even take pictures of flowers; it's ridiculous. Sometimes I think I was better off with photonegatives in shoeboxes.

Posted by Bill de hÓra on 2007-04-05

"How about RDF and Sparql?"

Darren, you're right. RDF can be partitioned N ways to Sunday; it's massively scalable. The naive rdbms design for RDF happens to be a SmallTable - 3 columns and N rows, divisible at any point in the rowset. BigTable as I understand it, might be more usable tho'.

Since I wrote that rant you quoted, I've seen two problems, that have given me pause:

Posted by Bill de hÓra on 2007-04-05

Have you read Rohit Khare's dissertation? He uses the same electrical grid example.

Posted by Mark Baker on 2007-04-05

Millions of triples in an RDF-backed stores is still several orders of magnitude away from the requirements. Since RDF normalises everything to triples, you need many more triples than you do rows in a DB, so the stress test case is thousand times smaller than a giga-store. Not that you can't put an RDF front end on it, but RDF isn't itself a store technology, and a community that considers 10 million data points a stress test isn't likely to produce a giga-store any time soon. The problem with triples is you then have to do joins to create objects out of them (such as an address), and joins cost, especially when distributed. Even with everything in memory, the RDF stores linked to above seem slow (one would expect a simple query on a in-memory 10 million triple store not to take more than 100ms on modern hardware - use 32 bit hashs and GPU mask operations, and write the hashed form to a binary dump at disk access speed). The quoted store takes '15 minutes to load about 150MB' - it takes 5s to read 150MB from disk on my laptop into memory; any representation that adds a couple of orders of magnitude to IO isn't going to scale past physical memory limits. rdfs:subClassOf either turns a simple query into a recursive one, or adds many extra triples to the store. RDF isn't the answer to how to implement a giga-store; it just begs the question, and adds recursive joins to the requirements. (I've no argument against the niceness of a flexible schema, so I'd like something like RDF to work, but RDF doesn't seem to be heading in that direction very quickly) Pete

Posted by Pete Kirkham on 2007-04-05

Mark, unfortunately (at least for me) dereferencing the URL results in a 403 response. But the following, formatted for duplex printing, is (all 30 Mb of it!): The following related papers are also accessible (and more modest in size):

Posted by Paul Sandoz on 2007-04-06

"Joinless, De-Normalized, Transcationless"

From the architectures described by Google and eBay, this is clearly the case at the data store level, but in fact joins and transactions still exist conceptually at what is now considered the application layer. Transactions are accomplished via work flow and state machines and data joined by stitching data together in memory.

Joins and transactions could be lifted out of the vertically-scalable data store and into some new horizontally-scalable "join/transaction" layer in a general way, but it would involve a whole new set of concepts for the application programmer to consider.

The proposed join/transaction layer would need to stitch information together using the some of the techniques described as "Level 5 Routing" by Phil Windley a few years ago:

Level 5 Routing for Web Services

Interestingly, the kind of routers listed there map pretty closely to the principal motivators behind AOP... I'll bet the metadata used to describe the joins and transactions at this layer would look a lot like aspects, though perhaps not with that name applied.

Posted by Winter on 2007-04-06

Oops. That link to Rohit's dissertation should be working now.

Posted by Mark Baker on 2007-04-06

I've been working on a framework to solve problems like this using Amazon Web Services (EC2/S3) and SQLite. It's theoretically infinitely scalable and I just described its basic architecture in this post:

Would love to hear feedback on it :)

Posted by Matt Jaynes on 2007-04-07

Joe, We are planning to release a beta that will let people deploy bigtable style databases (scale-out indices) in about a month. The project site is:

The license will be the same as BerkeleyDB or MySQL -- GPL with a FOSS (Free and Open Source) exclusion.

There is nothing there right now -- I'm planning to import the code concurrent with the first beta release. -bryan

Posted by Bryan Thompson on 2007-04-07

" Millions of triples in an RDF-backed stores is still several orders of magnitude away from the requirements. Since RDF normalises everything to triples, you need many more triples than you do rows in a DB, so the stress test case is thousand times smaller than a giga-store"

Pete, my experience has been one order of magnitude (10^2) blowup when storing in RDF instead of domain models a la RDBMS, and it's the rule of thumb I tell others. The thing is, RDF data is probably easier to partition than RDBMS backed data, so the question is whether 10^2 is a real deterrent relative to storage costs and latency of queries/joins. And technically I suspect RDF data can be partitioned dynamically, like the way we move compute jobs around today.

I doubt it's as easy to split something like the canonical user or items tables in a Web2.0 app.

Posted by Bill de hÓra on 2007-04-08

Real rdf that describes real world things is going to end up organized as nodes that look like hubs - the hubs will have lots of connections hanging off them, and their structure will be repetitive, if most of the data is of the same kind. There are likely to be plenty of bnode hubs, as well.

Each hub looks a lot like a row in a table. Yes, you could partition the triples differently, but for a given application, you probably won't.

In this way of looking at things, an rdf graph is essentially a highly normalized relational database. Now we know that to get real-world performance, you usually have to denormalize a database. The implemented schema may not resemble the conceptual one very much at all.

Where am I going with this? The huge databases being discussed here are presumably highly denormalized to get performance. I don't see any reason why you couldn't have a denormalized implementation for an rdf dataset. You design these implementations with your specific computing needs in mind, and they are no longer general-purpose systems.

IOW, the use of huge, denormalized, table implementations can be orthogonal to the question of whether to "use rdf", if that's what you want. But you probably can't have them and also have general-purpose data stores at the same time. Just like it always has been.

Posted by Thomas B Passin on 2007-04-08

Thomas. WRT: "I don't see any reason why you couldn't have a denormalized implementation for an rdf dataset.". ARC is an RDF database that allows just that. You can partition your RDF space such that your application benefits. I"ve done something similar with my relational model by breaking out 'is-a' relations.

Posted by Chimezie on 2007-04-09

Dynamic partitioning of rdf data works fine. We will be releasing a module for scale-out triple/quad stores over the basic bigdata framework. -bryan

Posted by bryan thompson on 2007-04-09

Bosworth was thinking along these lines a while back: acm queue

Posted by Dan Creswell on 2007-04-10


Yes, the ITConversations link I posted predates that ACM article by six months.

Posted by Joe on 2007-04-10