ETech '07 Summary - Part 2 - MegaData
Here's the thing, we need a new kind of data store, a new kind of SQL, something that does for storing and querying large amounts of data what SQL did for normalized data.
Update: Follow-up here.
Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.
Examples in the wild
Here are some of the examples I've seen recently, most of which was brought together by my attendance at ETech:
- Jeff Jonas
- The work Jeff does is pretty amazing and you should read this paper to get an idea of the kinds of analysis he's doing. One thing you don't see in that paper, but I did see on one of his slides at ETech, was the limits you need to put on yourself when storing a billion rows in a database, and they included: no joins, no transactions, no stored procedures, and no triggers.
- Joshua Schacter
- Joshua has similar suggestions from his experience building del.icio.us: no joins, no transactions, no autoincrement.
- BigTable
- I have talked previously about BigTable, Google's column-based store with no transactions. Yahoo! has their own technology similar to BigTable.
- eBay
- Slides from Randy Shoup and Dan Pritchett show how eBay runs with referential integrity, joins, and sorting moved out of the database and into the application.
I'm detecting a pattern here.
Common Themes
Some common themes are emerging. If you want to scale to the petabyte level, or the billion requests a day, you need to be:
- Distributed
- The data has to be distributed across multiple machines.
- Joinless
- No joins, and no referential integrity, at least at the data store level.
- De-Normalized
- No one said this explicily, but I presume there is a lot of de-normalization going on if you are avoiding joins.
- Transcationless
- No transactions
Those constraints represent something fundamentally different from a relational database.
Who would buy that?
If we build something like that who would use it? There are only so many Googles and eBays in the world, right?.
Thomas Watson: I think there is a world market for maybe five computers.
We are barely scratching the surface of data today. More and more people are going to be going through this, which is what I've been pointing out lately, for example, when I asked how many motors and things with IP addresses you had in your house. We're already averaging 12 IPs per household. When we counted motors we got around 80 to 100. There are going to be more and more devices with IP addresses in your house, and many of those devices will actually be generating data, continuously.
Here is just one simple example. How often is your electrical usage measured? Once a month? What if it was measured every second? Combine that the Jeff Jonas anaylysis/anonymization work and Jeff Hawkin's Hierarchical Temporal Memory and do you have something that might be able to predict energy consumption? Maybe pick out some abberations on the grid? Preemptively offer you discounts for peak periods?
I don't know what's possible, or even useful, when it comes to monitoring and distributing electricity. The only thing I do know is that there's a lot of potential for disruption here.
But let's not get sidetracked by the example, this isn't a call for green tech, or an intelligent grid. This is a call to think about the data collection and processing farms that are going to have to be built, and the kind of data store required, to use all that data.
There are obvious privacy, security, and social issues of all this data collection and this has to be thought through before building these systems. For just one example, think about the publicity when Al Gore's utility bill was made public. That was just a simple set of monthly measurements of his gas and electricity usage.
Think of the developers!
So what about the poor developers that need to develop on top this rather strange platform, how will they fare? From Bigtable: A Distributed Storage System for Structured Data
Given the unusual interface to Bigtable, an interest- ing question is how difficult it has been for our users to adapt to using it. New users are sometimes uncertain of how to best use the Bigtable interface, particularly if they are accustomed to using relational databases that support general-purpose transactions. Nevertheless, the fact that many Google products successfully use Bigtable demon- strates that our design works well in practice.
The only difference between today and two years ago when Adam Bosworth gave his talk Database Requirements in the Age of Scalable Services is that there's a lot more public knowledge about what the likes of Google and eBay are doing.
[ 2007-04-11 ]
"Think of the developers!"
To heck with 'em. I just want someone to figure out how to manage my own personal terabyte's worth: backup driven development.
Pilgrim had a much better rant a while back about backing up. It's hard to be sure you have everything backed up. Or worse you end up with duplicated files on each computer (gee thanks, itunes).
I really am no more than 6 months away from needing a personal datacenter. I've been putting it off for about 2 years. And I don't even take pictures of flowers; it's ridiculous. Sometimes I think I was better off with photonegatives in shoeboxes.
Posted by Bill de hÓra on 2007-04-05
"How about RDF and Sparql?"
Darren, you're right. RDF can be partitioned N ways to Sunday; it's massively scalable. The naive rdbms design for RDF happens to be a SmallTable - 3 columns and N rows, divisible at any point in the rowset. BigTable as I understand it, might be more usable tho'.
Since I wrote that rant you quoted, I've seen two problems, that have given me pause:
- domain models. developers can't work easily with arbitrary xml, so arbitrary graphs are right out. It's much easier if everything is a row in a table or an object. I think this is part of the reason that rails and django are so productive; they're highly optimised for domain models. Raw RDF doesn't really do domains like that; you have to expend effort distilling triples into 'things'; witness how much angst WS marshalling causes; very few developers will give up on objects
- decentralised search. I think whoever figures out peered search will change everything. Right now it's all about downloading the web into a datacenter and distributing the index across a fast WAN. The Sparql people seem to be thinking along the same lines - everything will be behind some uber-endpoint a la Google. Doug Cutting (nutch, lucene) once said the reason decentralised search doesn't work is latency. That's the key; figure out a way to get people not to care about latency (eg by providing higher precision), and you'll disrupt search.
Posted by Bill de hÓra on 2007-04-05
Posted by Mark Baker on 2007-04-05
Posted by Pete Kirkham on 2007-04-05
Posted by Paul Sandoz on 2007-04-06
From the architectures described by Google and eBay, this is clearly the case at the data store level, but in fact joins and transactions still exist conceptually at what is now considered the application layer. Transactions are accomplished via work flow and state machines and data joined by stitching data together in memory.
Joins and transactions could be lifted out of the vertically-scalable data store and into some new horizontally-scalable "join/transaction" layer in a general way, but it would involve a whole new set of concepts for the application programmer to consider.
The proposed join/transaction layer would need to stitch information together using the some of the techniques described as "Level 5 Routing" by Phil Windley a few years ago:
Level 5 Routing for Web Services
Interestingly, the kind of routers listed there map pretty closely to the principal motivators behind AOP... I'll bet the metadata used to describe the joins and transactions at this layer would look a lot like aspects, though perhaps not with that name applied.
Posted by Winter on 2007-04-06
Posted by Mark Baker on 2007-04-06
Would love to hear feedback on it :)
Posted by Matt Jaynes on 2007-04-07
Posted by Bryan Thompson on 2007-04-07
" Millions of triples in an RDF-backed stores is still several orders of magnitude away from the requirements. Since RDF normalises everything to triples, you need many more triples than you do rows in a DB, so the stress test case is thousand times smaller than a giga-store"
Pete, my experience has been one order of magnitude (10^2) blowup when storing in RDF instead of domain models a la RDBMS, and it's the rule of thumb I tell others. The thing is, RDF data is probably easier to partition than RDBMS backed data, so the question is whether 10^2 is a real deterrent relative to storage costs and latency of queries/joins. And technically I suspect RDF data can be partitioned dynamically, like the way we move compute jobs around today.
I doubt it's as easy to split something like the canonical user or items tables in a Web2.0 app.
Posted by Bill de hÓra on 2007-04-08
Real rdf that describes real world things is going to end up organized as nodes that look like hubs - the hubs will have lots of connections hanging off them, and their structure will be repetitive, if most of the data is of the same kind. There are likely to be plenty of bnode hubs, as well.
Each hub looks a lot like a row in a table. Yes, you could partition the triples differently, but for a given application, you probably won't.
In this way of looking at things, an rdf graph is essentially a highly normalized relational database. Now we know that to get real-world performance, you usually have to denormalize a database. The implemented schema may not resemble the conceptual one very much at all.
Where am I going with this? The huge databases being discussed here are presumably highly denormalized to get performance. I don't see any reason why you couldn't have a denormalized implementation for an rdf dataset. You design these implementations with your specific computing needs in mind, and they are no longer general-purpose systems.
IOW, the use of huge, denormalized, table implementations can be orthogonal to the question of whether to "use rdf", if that's what you want. But you probably can't have them and also have general-purpose data stores at the same time. Just like it always has been.
Posted by Thomas B Passin on 2007-04-08
Posted by Chimezie on 2007-04-09
Posted by bryan thompson on 2007-04-09
Posted by Dan Creswell on 2007-04-10
Yes, the ITConversations link I posted predates that ACM article by six months.
Posted by Joe on 2007-04-10
Sorry, comments on this entry are closed. Thanks for participating.
How about RDF and SPARQL? Doesn't "Distributed, Joinless, De-Normalized, and Transcationless" pretty much describe RDF + SPARQL + distributed data sources from around the web? There are implementations capable of handling millions of triples (see http://www.ldodds.com/blog/archives/000216.html, http://morenews.blogspot.com/2004/02/kowari-101.html, or http://www.lassila.org/blog/archive/2006/03/bigger_datasets.html, for example).
Way back in the day, Bill de hÓra said:
Posted by Darren Chamberlain on 2007-04-05