Character Encoding is Hard

Joe Gregorio

Character encoding is hard. Really. If I could point to one thing that causes feeds to be invalid more than anything else, it would be character encoding. When I first started working with RSS I was always suprised at the energy Bill Kearney put into character encoding. If there was one thing you could count on, it was Bill would always jump into a conversation on character encoding. Two years later and I am finally coming to that place. That place where I jump into any discussion on character encoding. I finally get it, and I finally see what grief is causes with XML, and not just in RSS feeds, but in other areas too. Don't believe me? Not even DMOZ can get character encoding right [via diveintomark].

You see, this is one of the things about XML, a conformant XML processor is only required to accept "utf-8" and "utf-16". So it's possible that an XML processor could reject "Shift_JIS", or "ISO-2022-JP". Who knows, there might even be an XML processor out there that rejects well-formed XML encoded in "utf-32". The more I learn about character encoding the more I like "utf-8".

Amen to that, Joe. I deal with character encoding daily with my work on ecto: http://www.kung-foo.tv/blog/archives/000818.php

Posted by Adriaan on 2004-03-24

Hey Joe

you might like this one

http://www.joelonsoftware.com/articles/Unicode.html

Posted by Karl on 2004-03-27

comments powered by Disqus