URI and IRI Templates, Oy

Joe Gregorio

What have I gotten myself into?

When I first started looking at URI templates I was surprised no one had written a specifiction for them yet. It seemed so simple, "just" add {name} to the URI and then substitute with a value at a later time. After bashing my head against the wall for a couple weeks, here is a synopsis of the character encoding issues involved in doing URI and IRI Templates.

We have several open issues:

  1. Deciding which characters to escape.
  2. Reserving some character in template variable names for future use, ala ':' for XML namespaces.

While this is a long post, I will only cover the issues involved in #1.

My over-arching goal of URI-Templates, and I believe this is necessary to make them a success, it to make URI Templates simple by being opinionated, as Sam described it.

Grounding

First let's dispel the notion that you can come up with the perfect URI-Template to URI translation mechanism that will always produce a valid URI regardless of the scheme. That last part, "regardless of the scheme", is the crux of the problem. While RFC 3986 defines what a URI looks like, schemes may impose further restrictions. For example, while

   tel:bitworking.org 

matches the ABNF in RFC 3986, it is not a valid tel: URI, and it never will be.

We have two choices:

  1. Define a mechanism that is only guaranteed to meet the UR syntax (i.e. RFC 3986), and thus potentially generate URIs that are invalid in some schemes.
  2. Restrict ourselves to URIs of a particular scheme such as http: or mailto:.

Just for reference, here is a set of example URIs from RFC 3986:

ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2

Serendipity

As an aside, it turns out that the regular expression given in Appendix B of RFC 3986 is capable of parsing up URI Templates, but only if the characters allowed in template variable names are restricted, and only if template variables are not allowed to span components.

Here is a Python implementation that uses that regular expression:


URI = re.compile(r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?")

def parse_uri(uri):
    """Parses a URI using the regex given in Appendix B of RFC 3986.

        (scheme, authority, path, query, fragment) = parse_uri(uri)
    """
    groups = URI.match(uri).groups()
    return (groups[1], groups[3], groups[4], groups[6], groups[8])

And if we run that over the example URIs with templated parts added in:


print parse_uri("http://{server}/rfc/rfc2396.txt")
print parse_uri("ftp://ftp.is.co.za/{dir}/rfc1808.txt")
print parse_uri("ldap://[2001:db8::7]/c={country}?objectClass?one")
print parse_uri("mailto:{addr}")
print parse_uri("news:comp.infosystems.www.servers.{server}")
print parse_uri("tel:+{number}")
print parse_uri("telnet://192.0.2.16:{port}/")
print parse_uri("urn:oasis:names:specification:docbook:dtd:{version}")

We get:


('http', '{server}', '/rfc/rfc2396.txt', None, None)
('ftp', 'ftp.is.co.za', '/{dir}/rfc1808.txt', None, None)
('ldap', '[2001:db8::7]', '/c={country}', 'objectClass?one', None)
('mailto', None, '{addr}', None, None)
('news', None, 'comp.infosystems.www.servers.{server}', None, None)
('tel', None, '+{number}', None, None)
('telnet', '192.0.2.16:{port}', '/', None, None)
('urn', None, 'oasis:names:specification:docbook:dtd:{version}', None, None)

This is important because it makes it easy to parse up a URI Template if we want to impose different escaping requirements on different components.

What to %-encode

Certain characters are going to have to be %-encoded to ensure that filling in a URI-Template doesn't destroy the structure of the URI. For both URIs and IRIs the 'reserved' set of characters are the ones that are going to cause trouble and need to be escaped.

   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Each part of an IRI has its own acceptable chars:

     scheme         = ALPHA / DIGIT / "+" / "-" / "." 
     iauthority     = ipchar
     ipath          = ipchar  / "/"
     iquery         = ipchar / iprivate / "/" / "?" 
     ifragment      = ipchar / "/" / "?" 

where:

     ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"

The rules are the same for URIs, except drop all the 'i's off the beginning of the names, and drop iprivate.

So let's begin with a simple approach, how about escaping all the characters in 'reserved'? If we do then you can't do this:

   http://example.org?{fred}
   fred="q=2"

   http://example.org?q=2

That might seem too restrictive, so let's make that example concrete.

   http://www.google.com/search?q={term}
   term="Ben&Jerrys"

If reserved characters are escaped then the URI Template expands to:

   http://www.google.com/search?q=Ben%26Jerrys

That search gives you the results you would expect. If reserved characters are NOT escaped then you get a very different search result:

   http://www.google.com/search?q=Ben&Jerrys

And that does *not* give the expected results.

So let's always escape, right? Not so fast. If we always escape reserved characters we get

   mailto:{address}
   address="joe@bitworking.org"

expanding to

   mailto:joe%40bitworking.org

which is not what you want to happen.

Like I said, we can't come up with something guaranteed to generate only valid URIs unless we restrict ourselves to a particular scheme, which isn't as useful as defining templates for all URIs. So what if we pick a subset of 'reserved' that does not get %-encoded? Can we pick a subset that produces the least surprising results? Here is my suggestion, to escape all the characters in 'reserved' except the following three:

  '@' / ':' / '/'

The above subset seems to generate the 'least suprising' results:

  • Our Ben&Jerrys query to Google still works.
  • The mailto: example works.
  • Http paths also work as expected:
       http://bitworking.org/{path}
       path="projects/httplib2/"
    
       http://bitworking.org/projects/httplib2/
    

Like I said, it's not perfect:

   http://{sub}.example.org/index.html
   sub="a/b"

   http://a/b.example.org/index.html

Which is clearly an invalid URI. So do we give special escaping rules for authority? That at least makes the results match the URI syntax, but for the HTTP scheme the string a%2Fb.example.org isn't a valid domain name. And don't even get me started on how this could go bad if you allowed template varibles in the scheme:

   {scheme}://bitworking.org
   scheme="gopher"

   gopher://bitworking.org

On the other hand, I could see useful applications:

   http{ssl}://bitworking.org
   ssl="s"

   https://bitworking.org

So we have a few possibilities:

  1. Escape all 'reserved' characters except @, :, and / across every component, realizing we may not end up with a valid URI.
  2. Escape all 'reserved' characters except @, and :, realizing that our 'path' example will then break since '/' will get escaped.
  3. Escape all 'reserved' characters except @, :, and /, but only allow template variables in path, query and fragment components.

IRIs

Just as another aside, I am no longer afraid of IRIs.

The Algorithm

Let's start with IRIs since those are actually simpler, and let's also assume that we choose #1 of the options above:

  1. Escape all 'reserved' characters except @, :, and / across every component, realizing we may not end up with a valid URI.

Algorithm:

  1. Start with an IRI Template (noting that URIs are also IRIs):
           http://example.org/{blah}
    
  2. Percent encode every character in the values of the template variables that aren't in ( iprivate | iunreserved | '@' | ':' | '/' )
  3. Substitute variables with their values, which produces an IRI.

Note that we could use the same algorithm for URI Templates as long as we add a fourth step:

  1. Convert the IRI to a URI following Section 3.1 of RFC 3987.

Hopefully reading this has been as helpful for you as writing it has been for me, and some of the subtle issues in character handling that need to be more strictly specified in the next revision of the specification are clearer. I also posted this to the W3C URI mailing list so feel free to follow up there with any comments.

comments powered by Disqus