Is there a Regex-able subset of XML?
Let me explain that question in some more detail, first by giving some background. I have been working with XML a lot recently in different contexts, in Python on server-side CGI scripts, in Windows via C#, and under DOS using the Borland 4.51 compiler (don't ask). I have come to a bold and brazen conclusion: XML Sucks.
I didn't always believe that XML sucks. I don't even really think it sucks today, I am just frustrated at the current crop of tools for manipulating XML. I have mentioned this before when talking about the DOM. The replies I got from that post we're helpful but didn't get to the root of the problem. All the APIs I have looked at, SAX, DOM, pull, etc, all have problems. It was only recently that I came to realize that the problems in the APIs are just manifestations of fundamental problems with XML itself. As an example of this look at the following XML documents:
<item xmlns:dc="http://purl.org/dc/elements/1.1/">
<title>MetaData</title>
<dc:date>2003-01-12T00:18:05-05:00</bc:date>
<link>http://bitworking.org/news/8</link>
<description>Upon waking, the dinosaur...</description>
</item>
<root:item xmlns:bc="http://purl.org/dc/elements/1.1/" xmlns:root="" >
<root:title>MetaData</root:title>
<bc:date>2003-01-12T00:18:05-05:00</bc:date>
<root:link>http://bitworking.org/news/8</root:link>
<description>Upon waking, the dinosaur...</description>
</root:item>
Both of these documents need to be treated exactly the same.
When I wrote the first version of Aggie I used the XML parser that comes with .Net. I was apalled to find aggregator writers that had used regular expressions to parse the RSS. How fool-hardy was that, look at the above examples and you can see why that strategy will fail. I sneered at the regex-users and moved on to bigger and better things.
Time passes.
More XML experience is gained by yours truly and on many occasions I have found myself pining for the ability to do regular expression processing of XML. If only the pathologies of the above examples didn't exist then I could use a combination of XPath and regular expressions to perform XML manipulations that would be easier for me to implement, understand and maintain. Each time I had a craving for regex's I would suppress the urge, tell myself that it was the wrong thing to do. XML and regexs just don't mix.
Today I reached the breaking point. The problem isn't with regular expressions, the problem is with XML. The pathologies in XML that preclude the use of regular expressions are just that, pathologies, and ones that need to be excised. Now there are plenty of attempts to subset XML, but all the ones I have reviewed are centered around building faster or simpler parsers. A noble goal but I want to take that one step further and obviate the requirement of an XML parser. The DOM is there because of the pathologies. Subsetting XML and keeping the DOM is no solution at all. The point of an XML subset that I want is to enable regular-expression manipulation. Now I don't want to throw out the baby with the bathwater and any subset of XML also needs to still support namespaces. Call me what you will, but I Expect More. The idea of declaring a subset is that if an XML document conforms to the subset then you get the same data out of it if you process it with regular expressions of if you process it with an XML processor.
So what would such a subset look like? Off the top of my head here are the rules for such a subset:
- All namespace declarations must be done in the root element.
- Never a declaration for the "" namespace. I.e. if an element sits the "" namespace then the element name will never have a namespace qualifier.
- No CDATA sections.
- No DTDs.
Hmm, that list is shorter than I thought it would be. I must be missing something. Either way, if you're up for a lot of reading on subsetting XML you can check out this thread on XML-DEV.
Posted by Tony Mendoza on 2003-03-18
Posted by Howie on 2003-03-18
Posted by Pascal on 2003-03-18
Posted by anonymous on 2003-03-19
Posted by anonymous on 2003-03-22
Posted by anonymous on 2003-03-24
Posted by anonymous on 2003-03-27
Posted by Fabrizio Castrotorres on 2003-03-28
Posted by peufeu on 2003-04-16
ConciseXML solves the problems with
XML. See http://www.concisexml.org
Water is a new Web services and all-purpose programming language that allows you to program in ConciseXML syntax.
Water is a dynamic language that unites the worlds of objects and Web Services in an elegant way. Water adheres to a "Learn Once, Use Everywhere" philosophy where data, logic, and presentation are manipulated and represented in a consistent way. Water is an All-Level Language because it can be used for both high-level and low-level tasks. Water was designed to support both object-oriented programming and functional programming.
Posted by Mike Plusch on 2003-09-25
The "push" model of SAX is backwards from the traditional methods of parser construction (recursive descent / yacc / lexx / antlr / etc.) which use a "pull" model, in which you call a "GetNextToken" function to get information about the next token in the stream. Instead of the high level events being signaled by callbacks, the raw input stream is consumed via a callback.
Why is the pull model superior? Because one of the biggest hassles with callbacks is having to keep the state variables around from one callback to the next. One of my long-standing programming maxims is that "The most convenient state variable is generally the program counter". SAX requires us to throw away this convenience. For example, when parsing a recursive structure in SAX, instead of being able to use the naturally-occuring stack structure of function calls, we are generally forced to implement our own stack mechanism.
So in a typical pull-model parsing situation, you would have a simple for loop that calls NextToken(), and when it recognizes a particular element it calls a subroutine to handle that element and its children; That subroutine would contain a similar for loop that would continue to iterate until the matching end element was found. What could be simpler?
Now, certainly there are some things, such as entity resolution, that should continue to be done via callback. However, the parser could support the notion of optional callbacks - that is, when you initialize the parser you could specify which token types you want to handle via callback.
Another thing that the "pull" model brings is the possibility of applying Bison/ANTLR-like technology for the contstuction of "XML Parser Generators" using grammar rules. These types of systems generally require a "token stream" with some degree of lookahead. Relatively trivial to implement in a pull model, but more challenging in a push model.
Posted by Talin on 2003-11-28
Posted by Sean McGrath on 2004-10-30
Posted by anonymous on 2003-03-18