What have I gotten myself into?
When I first started looking at URI templates I was surprised no one had written a specifiction for them yet. It seemed so simple, "just" add {name} to the URI and then substitute with a value at a later time. After bashing my head against the wall for a couple weeks, here is a synopsis of the character encoding issues involved in doing URI and IRI Templates.
We have several open issues:
While this is a long post, I will only cover the issues involved in #1.
My over-arching goal of URI-Templates, and I believe this is necessary to make them a success, it to make URI Templates simple by being opinionated, as Sam described it.
First let's dispel the notion that you can come up with the perfect URI-Template to URI translation mechanism that will always produce a valid URI regardless of the scheme. That last part, "regardless of the scheme", is the crux of the problem. While RFC 3986 defines what a URI looks like, schemes may impose further restrictions. For example, while
tel:bitworking.org
matches the ABNF in RFC 3986, it is not a valid tel: URI, and it never will be.
We have two choices:
http: or mailto:. Just for reference, here is a set of example URIs from RFC 3986:
ftp://ftp.is.co.za/rfc/rfc1808.txt http://www.ietf.org/rfc/rfc2396.txt ldap://[2001:db8::7]/c=GB?objectClass?one mailto:John.Doe@example.com news:comp.infosystems.www.servers.unix tel:+1-816-555-1212 telnet://192.0.2.16:80/ urn:oasis:names:specification:docbook:dtd:xml:4.1.2
As an aside, it turns out that the regular expression given in Appendix B of RFC 3986 is capable of parsing up URI Templates, but only if the characters allowed in template variable names are restricted, and only if template variables are not allowed to span components.
Here is a Python implementation that uses that regular expression:
URI = re.compile(r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?")
def parse_uri(uri):
"""Parses a URI using the regex given in Appendix B of RFC 3986.
(scheme, authority, path, query, fragment) = parse_uri(uri)
"""
groups = URI.match(uri).groups()
return (groups[1], groups[3], groups[4], groups[6], groups[8])
And if we run that over the example URIs with templated parts added in:
print parse_uri("http://{server}/rfc/rfc2396.txt")
print parse_uri("ftp://ftp.is.co.za/{dir}/rfc1808.txt")
print parse_uri("ldap://[2001:db8::7]/c={country}?objectClass?one")
print parse_uri("mailto:{addr}")
print parse_uri("news:comp.infosystems.www.servers.{server}")
print parse_uri("tel:+{number}")
print parse_uri("telnet://192.0.2.16:{port}/")
print parse_uri("urn:oasis:names:specification:docbook:dtd:{version}")
We get:
('http', '{server}', '/rfc/rfc2396.txt', None, None)
('ftp', 'ftp.is.co.za', '/{dir}/rfc1808.txt', None, None)
('ldap', '[2001:db8::7]', '/c={country}', 'objectClass?one', None)
('mailto', None, '{addr}', None, None)
('news', None, 'comp.infosystems.www.servers.{server}', None, None)
('tel', None, '+{number}', None, None)
('telnet', '192.0.2.16:{port}', '/', None, None)
('urn', None, 'oasis:names:specification:docbook:dtd:{version}', None, None)
This is important because it makes it easy to parse up a URI Template if we want to impose different escaping requirements on different components.
Certain characters are going to have to be %-encoded to ensure that filling in a URI-Template doesn't destroy the structure of the URI. For both URIs and IRIs the 'reserved' set of characters are the ones that are going to cause trouble and need to be escaped.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Each part of an IRI has its own acceptable chars:
scheme = ALPHA / DIGIT / "+" / "-" / "."
iauthority = ipchar
ipath = ipchar / "/"
iquery = ipchar / iprivate / "/" / "?"
ifragment = ipchar / "/" / "?"
where:
ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"
The rules are the same for URIs, except drop all the 'i's off the beginning of the names, and drop iprivate.
So let's begin with a simple approach, how about escaping all the characters in 'reserved'? If we do then you can't do this:
http://example.org?{fred}
fred="q=2"
http://example.org?q=2
That might seem too restrictive, so let's make that example concrete.
http://www.google.com/search?q={term}
term="Ben&Jerrys"
If reserved characters are escaped then the URI Template expands to:
http://www.google.com/search?q=Ben%26Jerrys
That search gives you the results you would expect. If reserved characters are NOT escaped then you get a very different search result:
http://www.google.com/search?q=Ben&Jerrys
And that does *not* give the expected results.
So let's always escape, right? Not so fast. If we always escape reserved characters we get
mailto:{address}
address="joe@bitworking.org"
expanding to
mailto:joe%40bitworking.org
which is not what you want to happen.
Like I said, we can't come up with something guaranteed to generate only valid URIs unless we restrict ourselves to a particular scheme, which isn't as useful as defining templates for all URIs. So what if we pick a subset of 'reserved' that does not get %-encoded? Can we pick a subset that produces the least surprising results? Here is my suggestion, to escape all the characters in 'reserved' except the following three:
'@' / ':' / '/'
The above subset seems to generate the 'least suprising' results:
http://bitworking.org/{path}
path="projects/httplib2/"
http://bitworking.org/projects/httplib2/
Like I said, it's not perfect:
http://{sub}.example.org/index.html
sub="a/b"
http://a/b.example.org/index.html
Which is clearly an invalid URI. So do we give special escaping rules for authority? That at least makes the results match the URI syntax, but for the HTTP scheme the string a%2Fb.example.org isn't a valid domain name. And don't even get me started on how this could go bad if you allowed template varibles in the scheme:
{scheme}://bitworking.org
scheme="gopher"
gopher://bitworking.org
On the other hand, I could see useful applications:
http{ssl}://bitworking.org
ssl="s"
https://bitworking.org
So we have a few possibilities:
Just as another aside, I am no longer afraid of IRIs.
Let's start with IRIs since those are actually simpler, and let's also assume that we choose #1 of the options above:
Algorithm:
http://example.org/{blah}
Note that we could use the same algorithm for URI Templates as long as we add a fourth step:
Hopefully reading this has been as helpful for you as writing it has been for me, and some of the subtle issues in character handling that need to be more strictly specified in the next revision of the specification are clearer. I also posted this to the W3C URI mailing list so feel free to follow up there with any comments.
2006-12-22