Regex-able Xml

Joe Gregorio

Is there a Regex-able subset of XML?

Let me explain that question in some more detail, first by giving some background. I have been working with XML a lot recently in different contexts, in Python on server-side CGI scripts, in Windows via C#, and under DOS using the Borland 4.51 compiler (don't ask). I have come to a bold and brazen conclusion: XML Sucks.

I didn't always believe that XML sucks. I don't even really think it sucks today, I am just frustrated at the current crop of tools for manipulating XML. I have mentioned this before when talking about the DOM. The replies I got from that post we're helpful but didn't get to the root of the problem. All the APIs I have looked at, SAX, DOM, pull, etc, all have problems. It was only recently that I came to realize that the problems in the APIs are just manifestations of fundamental problems with XML itself. As an example of this look at the following XML documents:

<item xmlns:dc="http://purl.org/dc/elements/1.1/">
  <title>MetaData</title>
  <dc:date>2003-01-12T00:18:05-05:00</bc:date>

  <link>http://bitworking.org/news/8</link>
  <description>Upon waking, the dinosaur...</description>
</item>
<root:item xmlns:bc="http://purl.org/dc/elements/1.1/" xmlns:root="" >
  <root:title>MetaData</root:title>

  <bc:date>2003-01-12T00:18:05-05:00</bc:date>
  <root:link>http://bitworking.org/news/8</root:link>
  <description>Upon waking, the dinosaur...</description>
</root:item>

Both of these documents need to be treated exactly the same.

When I wrote the first version of Aggie I used the XML parser that comes with .Net. I was apalled to find aggregator writers that had used regular expressions to parse the RSS. How fool-hardy was that, look at the above examples and you can see why that strategy will fail. I sneered at the regex-users and moved on to bigger and better things.

Time passes.

More XML experience is gained by yours truly and on many occasions I have found myself pining for the ability to do regular expression processing of XML. If only the pathologies of the above examples didn't exist then I could use a combination of XPath and regular expressions to perform XML manipulations that would be easier for me to implement, understand and maintain. Each time I had a craving for regex's I would suppress the urge, tell myself that it was the wrong thing to do. XML and regexs just don't mix.

Today I reached the breaking point. The problem isn't with regular expressions, the problem is with XML. The pathologies in XML that preclude the use of regular expressions are just that, pathologies, and ones that need to be excised. Now there are plenty of attempts to subset XML, but all the ones I have reviewed are centered around building faster or simpler parsers. A noble goal but I want to take that one step further and obviate the requirement of an XML parser. The DOM is there because of the pathologies. Subsetting XML and keeping the DOM is no solution at all. The point of an XML subset that I want is to enable regular-expression manipulation. Now I don't want to throw out the baby with the bathwater and any subset of XML also needs to still support namespaces. Call me what you will, but I Expect More. The idea of declaring a subset is that if an XML document conforms to the subset then you get the same data out of it if you process it with regular expressions of if you process it with an XML processor.

So what would such a subset look like? Off the top of my head here are the rules for such a subset:

  1. All namespace declarations must be done in the root element.
  2. Never a declaration for the "" namespace. I.e. if an element sits the "" namespace then the element name will never have a namespace qualifier.
  3. No CDATA sections.
  4. No DTDs.

Hmm, that list is shorter than I thought it would be. I must be missing something. Either way, if you're up for a lot of reading on subsetting XML you can check out this thread on XML-DEV.

No matter how one changes the syntax or tweak the parsers nothing will change. The problem inherently in XML is that !! XML IS NOT SELF CONTAINED !!! To step into the GOLDEN AGE we need programming languages which are SELF CONTAINED. So the solution is to produce a language which can describe itself. And if someone wants weird syntactical elements for his own pleasure he can do so, describe them and then use them. No one else is affected cause the FOUNDATION is the ONE self describing language and every one else is able to parse its document.

Posted by anonymous on 2003-03-18

Hmmmm.... Self-Contained, self describing languages... LISP??

Posted by Tony Mendoza on 2003-03-18

Any language can be self-describing -- it depends on the programmer. For instance, a C function with a name of fetchAccountsForCustomerID() does what? what's it return, presumably? how about a Java method? Customer.accountsForCustomerIDWithConnection() gee, you can pretty much tell what the two arguments are to that method, cant you? ( hint for the clueless: customer id and a connection ). now look at objective-c, which borrows heavily from Smalltalk: [Customer accountsForCustomerID:(int)value withConnection:(Connection *)conn] Check http://www.toodarkpark.org/computers/objc/ for a rather interesting read at this little-known language. and write cleaner, more descriptive code, regardless of language.

Posted by Howie on 2003-03-18

Forget about XML! Parse XML into s-expressions and manipulate them in a Lisp dialect, preferably Common Lisp. Many of XML's problems have been solved in the Lisp world for decades. Just my ?0.02

Posted by Pascal on 2003-03-18

You are obviously not interested in validating documents. Unfortunately, many of us are. Your regexp scheme does not address this at all, much less anything having to do with XML Schema validation. XML wasn't designed for processors, it was designed for humans, that's why it is so verbose and full of "pathologies" as you call them. I have no trouble manipulating DOM and SAX, and I'm sure others feel the same way. At least we have a standard API to deal with it.

Posted by anonymous on 2003-03-19

I like subsets, but perl and other langs have full XML parsers, then.. the right thing can be to use then... or to use a subset of xml. Maybe the XML generated for 3 rules of robo...oops... 3 rules of xml.

Posted by anonymous on 2003-03-22

Regular expressions work on strings, where as XML is a tree language. Therefore, you will not be able to work with XML *in general* using regular expressions, only certain schemata. Dig out the old computer science textbook and look in the index for "Pumping lemma"...

Posted by anonymous on 2003-03-24

Yes, XML sucks in the real world. I don't want to write gobs of code just to do a simple search and replace of a character within a data element buried in some XML hierarchy being used to transfer data between databases for example. One line of sed becomes pages of code, 5 minutes becomes an entire day. In many cases, XML sucks in the real world, your subset sounds fine with me, I have thought about just such a thing in the past.

Posted by anonymous on 2003-03-27

I agree that working with XML directly is a big pain in today's IT environment. However I think that is due to the way people think about dealing with XML. XML is really just 2 things 1. A way to store a document in a consistent logical way that can be checked and verified 2. A way to transport a document from one system to another in a protocol/machine independent way The missing piece is to convert the XML directly to an object. The folks doing massive amounts of JDBC/ADO.NET code have the same problems as the people parsing XML. What people really want is just a simple object model they can manipulate That's why I like JDO. I dont want to worry about how the data is stored, I just want to get the object and do what I need to do. What is needed is a XML to JDO mapping (at least for the Java camp) to make things easier to handle. I don't think its a problem inherent with XML, its a problem with the way people deal with XML. Yes the translation process eats up tons of memory while u create the DOM, yada, yada but in my view this is the elegant way to solve the problem.

Posted by Fabrizio Castrotorres on 2003-03-28

What you need is YaML !

Posted by peufeu on 2003-04-16

ConciseXML solves the problems with
XML.  See http://www.concisexml.org

Water is a new Web services and all-purpose programming language that allows you to program in ConciseXML syntax.

Water is a dynamic language that unites the worlds of objects and Web Services in an elegant way. Water adheres to a "Learn Once, Use Everywhere" philosophy where data, logic, and presentation are manipulated and represented in a consistent way. Water is an All-Level Language because it can be used for both high-level and low-level tasks. Water was designed to support both object-oriented programming and functional programming.

Posted by Mike Plusch on 2003-09-25

The "push" model of SAX is backwards from the traditional methods of parser construction (recursive descent / yacc / lexx / antlr / etc.) which use a "pull" model, in which you call a "GetNextToken" function to get information about the next token in the stream. Instead of the high level events being signaled by callbacks, the raw input stream is consumed via a callback.

Why is the pull model superior? Because one of the biggest hassles with callbacks is having to keep the state variables around from one callback to the next. One of my long-standing programming maxims is that "The most convenient state variable is generally the program counter". SAX requires us to throw away this convenience. For example, when parsing a recursive structure in SAX, instead of being able to use the naturally-occuring stack structure of function calls, we are generally forced to implement our own stack mechanism.

So in a typical pull-model parsing situation, you would have a simple for loop that calls NextToken(), and when it recognizes a particular element it calls a subroutine to handle that element and its children; That subroutine would contain a similar for loop that would continue to iterate until the matching end element was found. What could be simpler?

Now, certainly there are some things, such as entity resolution, that should continue to be done via callback. However, the parser could support the notion of optional callbacks - that is, when you initialize the parser you could specify which token types you want to handle via callback.

Another thing that the "pull" model brings is the possibility of applying Bison/ANTLR-like technology for the contstuction of "XML Parser Generators" using grammar rules. These types of systems generally require a "token stream" with some degree of lookahead. Relatively trivial to implement in a pull model, but more challenging in a push model.

Posted by Talin on 2003-11-28

For a regexable subset of XML, you might find RIG 2 of interest.

Posted by Sean McGrath on 2004-10-30

comments powered by Disqus