be conservative in what you do, be liberal in what you accept from others. [Section 2.10 - RFC 793]
This has been bandied about as a founding precept of the internet for years. I don't think it's ever been more controversial as when it meets XML.
Mark Pilgrim has been pushing the idea that there are no exceptions to Postel's Law and that every feed aggregator needs to support his vision of a liberal XML parser, using the above phrase as his justification. I'm going to approach this very carefully, first by agreeing with Mark that some feed aggregators could support a liberal parsing mode, but carefully delineate how that has almost nothing to do with exceptions to Postel's Law.
First, I agree that an aggregator should have liberal parsing. The end user doesn't and shouldn't care that there is anything called XML, nor that it could or couldn't be 'well-formed' or 'valid', both of which are rigorously defined terms for XML. They just want to catch up on their news and anything that slows them down will just force them to switch to another aggregator. This covers the 'be liberal in what you accept from others', but it completely ignores the 'be conservative in what you do' part of the phrase. It's a two part phrase that describes the characteristics of a robust system by laying out the responsibilities of both the client and the server. Go and read RFC 793 and look at how the robustness principle is applied, never is the client admonished to accept invalid data. Here is a lengthier wording in RFC 760:
In general, an implementation should be conservative in its sending behavior, and liberal in its receiving behavior. That is, it should be careful to send well-formed datagrams, but should accept any datagram that it can interpret (e.g., not object to technical errors where the meaning is still clear).
And RFC 2360 is even clearer (emphasis mine):
Or establish restrictions on what a protocol transmits, but be able to deal with every conceivable error received. Caution is urged in applying this approach in standards track protocols. It has in the past lead to conflicts between vendors when interoperability fails. The sender accuses the receiver of failing to be liberal enough, and the receiver accuses the sender of not being conservative enough. Therefore, the author is obligated to provide extensive detail on send and receive behavior.
To avoid any confusion between the two, recommend that standard authors specify send and receive behavior separately. The description of reception will require the most detailing. For implementations are expected to continue operating regardless of error received. Therefore, the actions taken to achieve that result, need to be laid out in the protocol specification. Standard authors should concern themselves on achieving a level of cooperation that limits network disruption, not just how to survive on the network. The appearance of undefined information or conditions must not cause a network or host failure. This requires specification on how to attempt acceptance of most of the packets. Two approaches are available, either using as much of the packet's content as possible, or invoking error procedures. The author should specify a dividing line on when to take which approach.
Never is it stated that you should accept absolutely everything under the sun, only that you should accept as much as possible and then have your protocol delineate clearly how invalid data is to be handled. Well talk later about how XML covers that. Note also the wording is in terms of vendor interoperability based on a specification, not the acceptance of random dreck hauled in off the street.
Mark's perceptions may be a bit colored. Two years ago when we both started to get into aggregators and syndication the quality of feeds was much worse than it is now. Upwards of 10% of all feeds were not even well-formed XML, never mind conforming to some notion of validity. With the expected consolidation of blogging software that we all knew would happen, and competition among real vendors making real money , the quality of the feeds has jumped and it's honestly been a good six months since I've seen a feed that wasn't valid XML.
Mark states: "There are no exceptions to Postel's Law. Anyone who tries to tell you differently is probably a client-side developer who wants the entire world to change so that their life might be 0.00001% easier. The world doesn't work that way."
We've already looked at the fact that Mark is ignoring a full half of Postel's Law, and at the fact that it was not an admonishment to accept invalid data, but let's ignore that for now and let's dig further into this idea that there are no exceptions to Postel's Law in the sense he is using it.
For example, are there ultra-liberal image viewers? Do you constantly run into JPEG and GIF files that won't display in your browser? How about an ultra-liberal PDF parser? Anyone seen a pop-up ad for an ultra-liberal MP3 player?
No? That's suprising because according to Mark there are no exceptions to Postel's Law. And if we follow that logic to the end then there must be a market for more liberal consumers of every format.
Why don't you need an ultra-liberal MP3 player? Because MP3 is a binary formt. No one is sitting up late a night hand-encoding a song. They are using tools to create the data and tools to consume the data.
Now you could throw up your hands and complain that I'm not comparing apples to apples since I am putting XML up against binary formats.
Mark states: "Various people have tried to mandate this principle out of existence, some going so far as to claim that Postel's Law should not apply to XML, because (apparently) the three letters X, M, and L are a magical combination that signal a glorious revolution that somehow overturns the fundamental principles of interoperability."
Well, the lettters X, M, L aren't magical, but the XML format does occupy a rather odd middle ground between a data, text, and presentation. As usual the world isn't simple and the classification for file formats isn't simple either. It actually breaks up into several axes. The first axis is text versus binary. JPEGs is a binary format while many unix configuration files like /etc/passwd are text files. The distinction here is a gradation though and not as binary as you might expect, for example, unicode files in either UTF-8 and UTF-16 are somewhere between text and binary.
The second axis is data versus language. Of the extreme data side is a JPEG file which is just a rigid description of the bits that make up the image. On the other extreme an HTML file contains written text mixed in with markup that suggests how it is to be displayed.
text ^ | data <---+---> language | V binary
On this chart where does our JPEG file fit? In the lower left corner. Same for both GIF and MP3. Plain text files, like my ascii art above sit in the top right corner, while /etc/passwd would sit in the top left corner.
Here is where things start to get tricky. Where does HTML sit? Because it can be encoded in UTF-8 and UTF-16 it is somewhere between text and ascii, but it is only used for presentation purposes, i.e. the primary consumers of HTML are browsers, which just display the HTML to the user, which puts it strongly to the language side of the data/language axis.
Where does XML sit? Here is the real flaw in Mark's argument, in that he has a myopic view of XML as only being used for syndication. But XML is used for much much more. It is used like HTML in XHTML, for markup in syndication, many configuration files for applications are now formatted in XML, and large corporations stream data between systems using XML. Let's just take two samples from that list, XHTML and a single financial transaction going between two banks. XHTML, like HTML, sits far to the language side of the data/language axis. On the other hand, the financial transaction sits firmly on the data side. Does this context make a difference? Let me put it this way, do you want your payroll deposit consumed by an ultra-liberal parser that 'fixes' errors in the XML as it consumes it? I didn't think so. But would you even care if a browser presented you a web page that was missing a closing tag on a div? Again, I didn't think so. Context is important and that's why blanket statements like "There are no exceptions to Postel's Law" are wrong, aside from the mis-representation of Postel's Law, and the ignoring the other half of the Law bit.
The other problem with blanket statements like that with XML is that XML has several levels of strictness. The first level is well-formedness and that means the document conforms to the basic rules that every XML file has to meet. An additional level of strictness beyond just well-formed is valid, which means that it is valid against some restriction on it's structure, which could come from a DTD, XML Scheme or a Relax NG schema.
Given these two levels of strictness and differing contexts you need to decide how strict you want to be. For example, with an XHTML document I would consider only well-formedness as a restriction for consumption. On the otherhand, at the very least I want my bank transaction validated against a schema, or two, or three. The context, and whether the information is sitting closer to data or language, determines how strict the consumer should be.
So let's get back to syndication. Where would a feed be on the chart?
Here is where edge cases come in, as a feed contains both data and language elements. The feed contains
some content clearly on the language side of the data/language axis, but
also contains items such as publication data and the permalink which
are clearly on the data side of the axis.
This is why I agree with Mark, as
a feed contains information on the language side of the data/language axis it
should be consumed with a liberal parser, but I argue that syndication
in general is an edge case and that strict edicts on how to consume XML
either liberal or strict, are inappropriate given the range of applications
over which XML is used.
The context in which you are 'consuming' the feed is also
important in determining if you should do liberal parsing.
Ensuring that the feed is always well-formed XML is important because
you don't know what it is being consumed for. The feed might be consumed
by an aggregator, but could as easily be destined for
transformation via XSLT or searched via XPath statements.
The first thing to define is what I consider to be a liberal parser. First, I am only looking for well-formed feeds, and if the feed it not well-formed then there are two fixes you can try to make the feed well-formed. The two main types of errors I have seen in feeds, and these cover 90% of the problem feeds in existence, are character encoding issues, and the inclusion of undefined character entities. Both of these are fairly easy to scan for and 'fix' before re-submitting the file to an XML parser to consume. As I stated before though, the number of problem feeds is dropping and the need for liberal parsing will disappear over time.
The biggest problem with liberal parsing is, who gets to decide what's liberal enough? Is Mark Pilgrim feed parser the gold standard of liberalness? Should we fix character encodings? How about character entities? Missing tags? How about if all the brackets were replaced with parentheses? This was one of the great insights in the XML specification in that it gave a very clearly defined lower-limit on what constituted an XML document.
Let's look at a feed with a couple missing tags:
<feed> <link href="http://example.com" rel="alternate" type="text/html" /> </entry> </feed>
Now does that link element refer to the feed or the entry? How should I decide, besides passing it through Mark's feed parser? You wouldn't know, which is why XML has a lower bound on what's acceptable, well-formedness, and that lower bound keeps you out of the messy business of trying to guess what the above example is trying to say.
In general the statement "There are no exceptions to Postel's Law" is wrong in the context it was used in, gives the wrong impression that all the work falls to the consumer, and also gives the wrong impression that it was advocating the consumption of invalid data. A little digging reveals the true intention of Postel's Law which is something for which I believe there is no escape, that is, as a guiding principle for implementors when building both clients and servers if you want interoperable implementations of a specification.
Tim Bray has an excellent write-up on this same subject. The short answer, if you can't generate a valid XML file then you're a bozo.