Joe Gregorio's writings (archives), projects and status updates.
Here's the thing, we need a new kind of data store, a new kind of SQL, something that does for storing and querying large amounts of data what SQL did for normalized data.
Update: Follow-up here.
Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.
Here are some of the examples I've seen recently, most of which was brought together by my attendance at ETech:
I'm detecting a pattern here.
Some common themes are emerging. If you want to scale to the petabyte level, or the billion requests a day, you need to be:
Those constraints represent something fundamentally different from a relational database.
If we build something like that who would use it? There are only so many Googles and eBays in the world, right?.
Thomas Watson: I think there is a world market for maybe five computers.
We are barely scratching the surface of data today. More and more people are going to be going through this, which is what I've been pointing out lately, for example, when I asked how many motors and things with IP addresses you had in your house. We're already averaging 12 IPs per household. When we counted motors we got around 80 to 100. There are going to be more and more devices with IP addresses in your house, and many of those devices will actually be generating data, continuously.
Here is just one simple example. How often is your electrical usage measured? Once a month? What if it was measured every second? Combine that the Jeff Jonas anaylysis/anonymization work and Jeff Hawkin's Hierarchical Temporal Memory and do you have something that might be able to predict energy consumption? Maybe pick out some abberations on the grid? Preemptively offer you discounts for peak periods?
I don't know what's possible, or even useful, when it comes to monitoring and distributing electricity. The only thing I do know is that there's a lot of potential for disruption here.
But let's not get sidetracked by the example, this isn't a call for green tech, or an intelligent grid. This is a call to think about the data collection and processing farms that are going to have to be built, and the kind of data store required, to use all that data.
There are obvious privacy, security, and social issues of all this data collection and this has to be thought through before building these systems. For just one example, think about the publicity when Al Gore's utility bill was made public. That was just a simple set of monthly measurements of his gas and electricity usage.
So what about the poor developers that need to develop on top this rather strange platform, how will they fare? From Bigtable: A Distributed Storage System for Structured Data
Given the unusual interface to Bigtable, an interest- ing question is how difﬁcult it has been for our users to adapt to using it. New users are sometimes uncertain of how to best use the Bigtable interface, particularly if they are accustomed to using relational databases that support general-purpose transactions. Nevertheless, the fact that many Google products successfully use Bigtable demon- strates that our design works well in practice.
The only difference between today and two years ago when Adam Bosworth gave his talk Database Requirements in the Age of Scalable Services is that there's a lot more public knowledge about what the likes of Google and eBay are doing.