Elliotte Rusty Harold

Apparently Google does not recognize XHTML, at least not when served as application/xhtml+xml. Try this search which should return exactly one hit pointing to an XHTML document. Notice that the file format is “unrecognized” and they offer to let you “View it as HTML”.

Ouch. I appear to be suffering from the same fate. Oddly enough, Sam doesn't seem to have that problem. The difference between us may be in content negotiation. Here is my code for determining when to use application/xhml+xml or text/html:

# Only serve XHTML to those clients that can understand it.
if serialization in matching:
        best = mimeparse.best_match(matching.keys(), environ.get('HTTP_ACCEPT', 
        (contenttype, serialization) = (best, match[best])
if serialization == 'xhtml' and environ.get('HTTP_USER_AGENT', '').find("MSIE") >= 0:
        (contenttype, serialization) = extensions['html']

Note the great bit of code for detecting and compensating for IE, which sends the equivalent of "*/*" in its Accept: header, yet doesn't handle XHTML.

The mimeparse module follows the algorithm given in RFC 2616 for matching the media type against the Accept: header, and it is covered in detail in my XML.com article, "Just use media types?".

Compare this to Sam's detection algorithm, which is implemented in his .htaccess file:

RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0\.?0*(\s|,|$)
RewriteCond %{REQUEST_URI} \.html$
RewriteRule .* - [T=application/xhtml+xml;charset=utf-8]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0\.?0*(\s|,|$)
RewriteCond %{REQUEST_URI} !\.
RewriteRule .* - [T=application/xhtml+xml;charset=utf-8]

The difference being that Sam will serve up his content as application/xhtml+xml only if that media type appears explicitly in the Accept: header, and it's not followed with a zero or fractional quality parameter. This neatly fixes the problem for IE and apparently also for the Google crawler. If I were a gambling man I'd wager that Google also sends Accept: */*, and just like IE, doesn't understand XHTML.

I've warned against using content negotiation before and this is just another example of the kinds of problems it can cause.

At this point I will probably drop back to text/html for all my pages and explore other means of displaying SVG, or convert the SVG into PNGs on the fly.

I got rid of the content negotiation, too (before going public). Inline MathML/SVG is not worth a) losing incremental rendering in Firefox and b) coping with the differences between XHTML and HTML.

Posted by Jeff on 2007-03-08

Living and dying by Google. Sigh. I had a funky XSLT-based web site for my old .mac site: http://homepage.mac.com/pmuellr . But of course, google couldn't index any of it. Really sad, because this is actually pretty good way to add boilerplate html goop around the pages on your site. I suppose I could have fixed this with a lot of .htaccess magic, but then I'd lose some of the great features of the system anyway, like being able to view the files directly from disk without having to have .htaccess magic fix things.

Posted by Patrick Mueller on 2007-03-08

How about using local quality preferences to weigh in on the content negotiation?

Your application can be configured to serve pages with MIME types in the following order of highest preference descending:

  • text/html
  • application/xhtml+xml

If the browser requests a resource with an ACCEPT header that matches both of these MIME types then you will always server up the MIME type with the highest local quality, I.E. text/html.

Only if a browser explicitly requests application/xhtml+xml will they be served XHTML.

Posted by Noah Slater on 2007-03-08

I think that the biggest reason to avoid serving things as 'application/xhtml+xml' is that both Opera and Firefox have the most monumentally crappy way of handling malformed or invalid XML: they display a generic XML validation error with only the first three lines of the XHTML source, so unless the person experiencing the error understands XML well-formedness they'll probably never see the real server-side error that's causing the problem as they would in IE.

A defect report at bugzilla.mozilla.org reported this problem something like six years ago, but unfortunately every Mozilla developer who sees it reacts with something like "mneeeh, if you can't guarantee your app will always output well-formed XML you shouldn't publish as XHTML!" as if the apps they write, much less *any* web application, actually does that. Even web services, which have to return well-formed XML to function at all, output plain text error messages in all sorts of conditions, under both .NET and Java and every other platform I've seen.

All they'd have to do is display the full XHTML source along with the invalid XML error and it would be much more workable. It's really unfortunate because Firefox is otherwise pretty good at providing informative error messages.

Posted by ticktock256 on 2007-03-08

Hmm, another view might be "Google Considered Harmful, Again". Going back to your previous post: [[ If I hand you a URI only, and that URI supports conneg, then I get no control over which representation you retrieve. ]] That in itself actually sounds positive, in that the choice is entirely with the consumer. But there are problems with clients like those listed. Dunno, maybe provision of the different representations redundantly at separate URIs (example.xml, example.html...) along with linked references (rel="alternate") may help.

Posted by Danny on 2007-03-08

The problem with content negotiation is not with the content negotiation idea itself, but with how it's wrongly implemented in Internet Explorer and probably Googlebot. Sending 'Accept: */*' is just rediculous. It's like saying the web won't evolve; "we won't have any more content types than what we have currently, but if new content types will emerge, we will certainly understand them because our developers has a time machine and knows everything that will ever happen anywhere on the web!" Unless what your application does is just dumbly dumping the content to a file system, you can be pretty damn sure labelling it with 'Accept: */*' is highly inappropriate and that it most definately will break something.

Posted by Asbjørn Ulsberg on 2007-03-09