Here's the thing, we need a new kind of data store, a new kind of SQL, something that does for storing and querying large amounts of data what SQL did for normalized data.
Update: Follow-up here.
Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.
Here are some of the examples I've seen recently, most of which was brought together by my attendance at ETech:
I'm detecting a pattern here.
Some common themes are emerging. If you want to scale to the petabyte level, or the billion requests a day, you need to be:
Those constraints represent something fundamentally different from a relational database.
If we build something like that who would use it? There are only so many Googles and eBays in the world, right?.
Thomas Watson: I think there is a world market for maybe five computers.
We are barely scratching the surface of data today. More and more people are going to be going through this, which is what I've been pointing out lately, for example, when I asked how many motors and things with IP addresses you had in your house. We're already averaging 12 IPs per household. When we counted motors we got around 80 to 100. There are going to be more and more devices with IP addresses in your house, and many of those devices will actually be generating data, continuously.
Here is just one simple example. How often is your electrical usage measured? Once a month? What if it was measured every second? Combine that the Jeff Jonas anaylysis/anonymization work and Jeff Hawkin's Hierarchical Temporal Memory and do you have something that might be able to predict energy consumption? Maybe pick out some abberations on the grid? Preemptively offer you discounts for peak periods?
I don't know what's possible, or even useful, when it comes to monitoring and distributing electricity. The only thing I do know is that there's a lot of potential for disruption here.
But let's not get sidetracked by the example, this isn't a call for green tech, or an intelligent grid. This is a call to think about the data collection and processing farms that are going to have to be built, and the kind of data store required, to use all that data.
There are obvious privacy, security, and social issues of all this data collection and this has to be thought through before building these systems. For just one example, think about the publicity when Al Gore's utility bill was made public. That was just a simple set of monthly measurements of his gas and electricity usage.
So what about the poor developers that need to develop on top this rather strange platform, how will they fare? From Bigtable: A Distributed Storage System for Structured Data
Given the unusual interface to Bigtable, an interest- ing question is how difficult it has been for our users to adapt to using it. New users are sometimes uncertain of how to best use the Bigtable interface, particularly if they are accustomed to using relational databases that support general-purpose transactions. Nevertheless, the fact that many Google products successfully use Bigtable demon- strates that our design works well in practice.
The only difference between today and two years ago when Adam Bosworth gave his talk Database Requirements in the Age of Scalable Services is that there's a lot more public knowledge about what the likes of Google and eBay are doing.
"Think of the developers!"
To heck with 'em. I just want someone to figure out how to manage my own personal terabyte's worth: backup driven development.
Pilgrim had a much better rant a while back about backing up. It's hard to be sure you have everything backed up. Or worse you end up with duplicated files on each computer (gee thanks, itunes).
I really am no more than 6 months away from needing a personal datacenter. I've been putting it off for about 2 years. And I don't even take pictures of flowers; it's ridiculous. Sometimes I think I was better off with photonegatives in shoeboxes.
Posted by Bill de hÓra on 2007-04-05
"How about RDF and Sparql?"
Darren, you're right. RDF can be partitioned N ways to Sunday; it's massively scalable. The naive rdbms design for RDF happens to be a SmallTable - 3 columns and N rows, divisible at any point in the rowset. BigTable as I understand it, might be more usable tho'.
Since I wrote that rant you quoted, I've seen two problems, that have given me pause:
Posted by Bill de hÓra on 2007-04-05
Posted by Mark Baker on 2007-04-05
Posted by Pete Kirkham on 2007-04-05
Posted by Paul Sandoz on 2007-04-06
Posted by Winter on 2007-04-06
Posted by Mark Baker on 2007-04-06
Posted by Matt Jaynes on 2007-04-07
Posted by Bryan Thompson on 2007-04-07
" Millions of triples in an RDF-backed stores is still several orders of magnitude away from the requirements. Since RDF normalises everything to triples, you need many more triples than you do rows in a DB, so the stress test case is thousand times smaller than a giga-store"
Pete, my experience has been one order of magnitude (10^2) blowup when storing in RDF instead of domain models a la RDBMS, and it's the rule of thumb I tell others. The thing is, RDF data is probably easier to partition than RDBMS backed data, so the question is whether 10^2 is a real deterrent relative to storage costs and latency of queries/joins. And technically I suspect RDF data can be partitioned dynamically, like the way we move compute jobs around today.
I doubt it's as easy to split something like the canonical user or items tables in a Web2.0 app.
Posted by Bill de hÓra on 2007-04-08
Real rdf that describes real world things is going to end up organized as nodes that look like hubs - the hubs will have lots of connections hanging off them, and their structure will be repetitive, if most of the data is of the same kind. There are likely to be plenty of bnode hubs, as well.
Each hub looks a lot like a row in a table. Yes, you could partition the triples differently, but for a given application, you probably won't.
In this way of looking at things, an rdf graph is essentially a highly normalized relational database. Now we know that to get real-world performance, you usually have to denormalize a database. The implemented schema may not resemble the conceptual one very much at all.
Where am I going with this? The huge databases being discussed here are presumably highly denormalized to get performance. I don't see any reason why you couldn't have a denormalized implementation for an rdf dataset. You design these implementations with your specific computing needs in mind, and they are no longer general-purpose systems.
IOW, the use of huge, denormalized, table implementations can be orthogonal to the question of whether to "use rdf", if that's what you want. But you probably can't have them and also have general-purpose data stores at the same time. Just like it always has been.
Posted by Thomas B Passin on 2007-04-08
Posted by Chimezie on 2007-04-09
Posted by bryan thompson on 2007-04-09
Posted by Dan Creswell on 2007-04-10
Yes, the ITConversations link I posted predates that ACM article by six months.
Posted by Joe on 2007-04-10
2007-04-05
How about RDF and SPARQL? Doesn't "Distributed, Joinless, De-Normalized, and Transcationless" pretty much describe RDF + SPARQL + distributed data sources from around the web? There are implementations capable of handling millions of triples (see http://www.ldodds.com/blog/archives/000216.html, http://morenews.blogspot.com/2004/02/kowari-101.html, or http://www.lassila.org/blog/archive/2006/03/bigger_datasets.html, for example).
Way back in the day, Bill de hÓra said:
Posted by Darren Chamberlain on 2007-04-05