The URI of a Weblog Entry

RSS 2.0 introduced the idea of a guid, an identifier that uniquely identifies an entry. When Sam called for a discussion of the essence of an entry, the permalink versus guid discussion began.

The discussion has settled down and the concensus seems to be that each entry will be identified by a URI. If the URI has the scheme http: or https: then the identifier is also considered a permalink, i.e. a link to the content on the web.

This is a really good compromise, giving people who want to supply permalinks the ability to do so, while giving tool vendors a wide range of options if they don't, or can't, supply a permalink. It is also forward looking, allowing you to build RSS feeds that point into other networks, such as freenode.

The use of URI's is good because it allows the assigning of a uniqe identifier, that could be a web address, but regardless fits within a proscribed syntax. The best reading of the lot, for implementation details, was the URI specification which includes a regular expression for parsing URIs:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

A litte Python shows how easy it is to parse up a URI, whether is it a web address or not:

>>> import re
>>> uri_regex = re.compile(r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?")
>>> uri_regex.search("http://www.ics.uci.edu/pub/ietf/uri/#Related").groups()
('http:', 'http', '//www.ics.uci.edu', 'www.ics.uci.edu', '/pub/ietf/uri/', 
    None, None, '#Related', 'Related')

Now if you don't want to use an http URI then the best bet seems to be the URN, which can be just as easily parsed by the same regular expression:

>>> uri_regex.search("URN:foo:a123,456").groups()
('URN:', 'URN', None, None, 'foo:a123,456', None, None, None, None)

Further reading