Is there a Regex-able subset of XML?
Let me explain that question in some more detail, first by giving some background. I have been working with XML a lot recently in different contexts, in Python on server-side CGI scripts, in Windows via C#, and under DOS using the Borland 4.51 compiler (don't ask). I have come to a bold and brazen conclusion: XML Sucks.
I didn't always believe that XML sucks. I don't even really think it sucks today, I am just frustrated at the current crop of tools for manipulating XML. I have mentioned this before when talking about the DOM. The replies I got from that post we're helpful but didn't get to the root of the problem. All the APIs I have looked at, SAX, DOM, pull, etc, all have problems. It was only recently that I came to realize that the problems in the APIs are just manifestations of fundamental problems with XML itself. As an example of this look at the following XML documents:
<item xmlns:dc="http://purl.org/dc/elements/1.1/"> <title>MetaData</title> <dc:date>2003-01-12T00:18:05-05:00</bc:date> <link>http://bitworking.org/news/8</link> <description>Upon waking, the dinosaur...</description> </item>
<root:item xmlns:bc="http://purl.org/dc/elements/1.1/" xmlns:root="" > <root:title>MetaData</root:title> <bc:date>2003-01-12T00:18:05-05:00</bc:date> <root:link>http://bitworking.org/news/8</root:link> <description>Upon waking, the dinosaur...</description> </root:item>
Both of these documents need to be treated exactly the same.
When I wrote the first version of Aggie I used the XML parser that comes with .Net. I was apalled to find aggregator writers that had used regular expressions to parse the RSS. How fool-hardy was that, look at the above examples and you can see why that strategy will fail. I sneered at the regex-users and moved on to bigger and better things.
More XML experience is gained by yours truly and on many occasions I have found myself pining for the ability to do regular expression processing of XML. If only the pathologies of the above examples didn't exist then I could use a combination of XPath and regular expressions to perform XML manipulations that would be easier for me to implement, understand and maintain. Each time I had a craving for regex's I would suppress the urge, tell myself that it was the wrong thing to do. XML and regexs just don't mix.
Today I reached the breaking point. The problem isn't with regular expressions, the problem is with XML. The pathologies in XML that preclude the use of regular expressions are just that, pathologies, and ones that need to be excised. Now there are plenty of attempts to subset XML, but all the ones I have reviewed are centered around building faster or simpler parsers. A noble goal but I want to take that one step further and obviate the requirement of an XML parser. The DOM is there because of the pathologies. Subsetting XML and keeping the DOM is no solution at all. The point of an XML subset that I want is to enable regular-expression manipulation. Now I don't want to throw out the baby with the bathwater and any subset of XML also needs to still support namespaces. Call me what you will, but I Expect More. The idea of declaring a subset is that if an XML document conforms to the subset then you get the same data out of it if you process it with regular expressions of if you process it with an XML processor.
So what would such a subset look like? Off the top of my head here are the rules for such a subset:
- All namespace declarations must be done in the root element.
- Never a declaration for the "" namespace. I.e. if an element sits the "" namespace then the element name will never have a namespace qualifier.
- No CDATA sections.
- No DTDs.
Hmm, that list is shorter than I thought it would be. I must be missing something. Either way, if you're up for a lot of reading on subsetting XML you can check out this thread on XML-DEV.