Megadata Follow-up

I posted a couple days ago about megadata data stores.

As an aside, I intentionally chose a really awful name, "megadata", since I know I'm bad at naming. Come on, I named one piece of software the Italian word for trash. I was hoping that someone would come up with a better name for me. Unfortunately, it appears that 'megadata' is sticking. I'm sorry.

Anyway, I got a lot of good comments, but for some reason the discussion veered off into RDF, which I don't see as providing a solution.

Let me draw a comparison to REST.

I spend a lot of time explaining REST to folks that are accustomed to using SOAP. One of the ways I explain the benefits of REST is that it 'gives more information to the network'. That is, the headers of a message tell if the request is safe, idempotent, or both, if the response is cachable, etc. The visibility of that information allows intermediaries to act on that message, speeding things up. With SOAP, all you have is POST, and a POST of an 'application/soap+xml' body, which provides as little information to the network as you possibly could with HTTP. That POST of a generic media type gives no indication if the request is safe, or idempotent, nor is there any indication of the cachability of the response.

This is also the reason why you could never write a WSDL to REST converter, there just isn't enough information present in a WSDL file. There's no information on idempotence, cachability, links, or etags. Only in the latest version of WSDL (2.0) can you indicate an operation is safe.

A similar thing is happening with these megadata stores. When you put together a database schema you break all your data down into normal form; normalized tables all related to each other. What you don't have is a way to indicate which of those relations are the important ones. When you de-normalize, you are making a choice, declaring implicitly which relations are the most important, or at least most performant, because you are denormalizing around those relations. This is the parallel to REST, just like using more methods and pushing information into the headers gives more information to the network, by denormalizing you are implicitly giving more information to the database, and that 'extra information' makes things run faster.

Of course, denormalization is a rather blunt instrument. Is there better information that could be supplied, information that would help a database spread storage and query load across hundreds, or even thousands, of machines?

So now you can see why going in the direction of RDF is the wrong way to move. We don't need new ways to slice data into finer pieces, in fact, quite the opposite. We need a way to remove, or at least give weights to, the relations we are already dividing along.

Still don't believe me that this is important and that people are out there working on this today? Check it out.

"So now you can see why going in the direction of RDF is the wrong way to move."

I say this a skeptic on semweb technology; - it's more likely you've made a case for it. I'll try to explain why here.

"by denormalizing you are implicitly giving more information to the database, and that 'extra information' makes things run faster."

Denormalising works by removes relations not qualifying them. The goal is to avoid computation associated with joins. The issue is that to denormalise data effectively, you have to know something about how the data is being utilised;. Therefore you optimise for some dimensions rather than others, and therfore as you point out, some information is needed about what dimensions are more important to denorm.

Perhaps we can go top down - write smart analysers to dynamically denorm data based on usage patterns; indeed database optimisation is an industry sector. But another, dumber, option is bottom up - avoid the initial structural 'typing' step and normalise where necessary.

To echo your post, let me draw a comparison to dynamic typing.

I spend a lot of time explaining languages like Python to folks that are accustomed to using Java. One of the ways I explain the benefits of Python is that it 'gives more degrees of freedom'. That is, while the static type system of a language like Java provides certain guarantees and optimisations, it restricts the plasticity of the code base, forcing developers to make more decisions ahead of time, and making changes in the light of new information and feedback more difficult.

A similar thing to dymnamic typing happens with RDF organised data. The data is structured into keys and values that can be combined and decoupled at will, just as you would manipulate dictionaries or slot frames in a programming language. Advanced declarations of structural types, as with the upfront normalisation step for an RDBMS, aren't necessary. You identify the objects and make statements about them, but you don't need to state the types of the objects and structure them. Normalisation can be done later on, based on demand. As importantly all key values are optional; their presence or absence does not hamper or crash query analysers or indexers. This allows a greater degree of freedom when it comes to indexing and partitioning data based on utilisation. It's also valuable for efficiently returning subsets from queries instead of entire rows or objects. Like dynamically typed languages, in RDF type decisions are deferred to the runtime.

RDF is not optimised for any particular usage, but can be optimized as needed. This is not the same approach as you advocate, it comes at things bottom up, but ultimately the same operational results are available in terms of large data sets. Another point is that because the base representation of RDF is uniformly a triple (a key value associated with a URL) or a quad (a key value associated with a URL and another URL indicating context), it's suitable for block level optimisations and handling sparseness that seem to be preferrable with solutions like bigtable

Posted by Bill de hOra on 2007-04-11

Bill,

Like dynamically typed languages, in RDF type decisions are deferred to the runtime.

I see the same type of strategies, and efficiencies, when dealing with "documents" over "data structures".

Posted by Joe on 2007-04-11

On the surface, it looks like RDF will scale maybe better than you think?

Posted by James Abley on 2007-04-12

Maybe I'm just clutching onto relational databases because I'm used to them, but I can't help but wonder if the best solution is something like "materialized views". The idea is to design a normalized database with all the referential integrity included just as you would for a small intranet project. Then you provide a number of queries (although it would probably have to be a limited subset of SQL) which are very important and the software manages all the denormalization for you, guaranteeing that your data is safe. I had been thinking a bit along these lines before you made this post, too, nothing concrete though.

Posted by Jeff on 2007-04-12

"when dealing with 'documents' over 'data structures'." A document (or entry) centric approach is absolutely scalable from pov of storage (I'll bet an 1812 backed twitter would not have the problems the Rails backed twitter currently does). So, no argument from me! There are other issues to deal with going document centric tho'. First is granularity; documents tend to get made from fragments or even fragments of other documents, so you end up doing a lot of slicing and dicing; it's tricky to figure what the atomic unit really is and it can introduce things like 'document skeletons'. I believe the problems raised here are often what drives people to want to use RDF description pixie dust inside CMSes. Microformats and purple numbers are a better known outcome caused by granularity issues. Or even, look at how the seaside framework moves away completely from pages towards components (aka sections). Second is layout - at the scales you're talking about, I'd guess that any non-synthetic organisation of documents might ensure the system never reached the scales you're talking about ;)

Posted by Bill de hOra on 2007-04-12

I've only skimmed Bill's comments (because he's usually right), but here's one way of approaching the problem:

Q. What's the distributed data system known to scale best?
A. The Web

Q. How does it represent data?
A. Links (binary relations) embedded in text blobs

Q. The Web is optimised for hypertext, how might the approach be generalised across domains?
A. Type the links, reduce the text

In practice this means giving URIs to all the significant entities and relationships (links). Or to put it yet another way, Just Use RDF.

Re. "We need a way to remove, or at least give weights to, the relations we are already dividing along." - well yes, that's what RDF Schema (/OWL) provides.

While I'm certainly not convinced an RDF model is ideal for local efficiences, I suspect it gets a lot closer to the optimal when trying to exploit a distributed environment like the web for arbitrary data.

Bill's order of magnitude cost to RDF compared to RDBMS did sound a reasonable rule of thumb, but that's when Codd's model had a 30yr head start. Now triples have had a few years the native systems that are appearing sound much closer to par, check LargeTripleStores.

btw, if you didn't already, go watch TimBL's The Semantic Web of Data video (streamed, 8mins 24secs), the hand-waving master at work.

Posted by Danny on 2007-04-15