Joe Gregorio's writings (archives), projects and status updates.
I posted a couple days ago about megadata data stores.
As an aside, I intentionally chose a really awful name, "megadata", since I know I'm bad at naming. Come on, I named one piece of software the Italian word for trash. I was hoping that someone would come up with a better name for me. Unfortunately, it appears that 'megadata' is sticking. I'm sorry.
Anyway, I got a lot of good comments, but for some reason the discussion veered off into RDF, which I don't see as providing a solution.
Let me draw a comparison to REST.
I spend a lot of time explaining REST to folks that are accustomed to using SOAP. One of the ways I explain the benefits of REST is that it 'gives more information to the network'. That is, the headers of a message tell if the request is safe, idempotent, or both, if the response is cachable, etc. The visibility of that information allows intermediaries to act on that message, speeding things up. With SOAP, all you have is POST, and a POST of an 'application/soap+xml' body, which provides as little information to the network as you possibly could with HTTP. That POST of a generic media type gives no indication if the request is safe, or idempotent, nor is there any indication of the cachability of the response.
This is also the reason why you could never write a WSDL to REST converter, there just isn't enough information present in a WSDL file. There's no information on idempotence, cachability, links, or etags. Only in the latest version of WSDL (2.0) can you indicate an operation is safe.
A similar thing is happening with these megadata stores. When you put together a database schema you break all your data down into normal form; normalized tables all related to each other. What you don't have is a way to indicate which of those relations are the important ones. When you de-normalize, you are making a choice, declaring implicitly which relations are the most important, or at least most performant, because you are denormalizing around those relations. This is the parallel to REST, just like using more methods and pushing information into the headers gives more information to the network, by denormalizing you are implicitly giving more information to the database, and that 'extra information' makes things run faster.
Of course, denormalization is a rather blunt instrument. Is there better information that could be supplied, information that would help a database spread storage and query load across hundreds, or even thousands, of machines?
So now you can see why going in the direction of RDF is the wrong way to move. We don't need new ways to slice data into finer pieces, in fact, quite the opposite. We need a way to remove, or at least give weights to, the relations we are already dividing along.
Still don't believe me that this is important and that people are out there working on this today? Check it out.