BitWorking

1812 Technical Overview

The source for 1812, the code that now runs this blog is now available. Note that I don't expect you to use it, the code is highly customized to my needs, but there may be cool bits you want to borrow.

First the name: 1812. I listened to the 1812 Overture, repeatedly, while coding, so thus the name.

Store Format

The underlying datastore in 1812 is a flat file database that has one file per entry. The format is an RFC 2822 file with headers at the beginning followed by [crlf][crlf] and then the body. The body contains the summary and the content, in that order, separated by a single ^L character. Both the summary and content are well-formed XHTML - at least when stuffed into a div - and are serialized as utf-8. You should note that ^L is not a valid unicode XML character so we don't need to worry about those appearing in the summary or content.

Entries are sorted by both their published and their updated times. This is done by populating directories with sym-links to the entry files. The name of each sym-link is the published/updated time in ISO format. That means getting a list of entries in updated order is a matter of getting a listing of a directory and sorting the names in alphabetical order.

URIs

Another aspect of 1812 is the URI structure. I have wrestled with this and if any one issue has stalled a rollout of a new set of software for this blog then URI structure would have to be it. There's always a tension between Cool URIs that don't change and the proper generation of unique ids. All of my old URIs are of the form /news/[id] where the id was some 'slug'ified version of the title. While that works, you run into problems if you generate a second post with the same title. What do you do, start appending numeric counters? /news/stuff, /news/stuff-2, /news/stuff-3, etc. What if the title of my post is 'Shrek-2', does my second post on 'Shrek-2' go to 'Shrek-3' or 'Shrek-2-1'? You can see the rathole looming large. On the other hand, generating URIs completely off a database index leads to un-Cool, or at least search unfriendly, URIs: /news/1, /news/2, etc.

The solution was partly inspired by Amazon, which recently starting adding slugs to their book URIs. For example:

http://www.amazon.com/Gravitys-Rainbow-Thomas-Pynchon/dp/0140283382/

Note the search friendly "Gravitys-Rainbow-Thomas-Pynchon" slug in the path. The problem with the Amazon solution is that they don't enforce the slug, that is, you can change it to anything you want as the id code at the end of the URI is the only thing used to look up the item. The following URI leads to the same place:

http://www.amazon.com/I-could-say-nasty-stuff-here/dp/0140283382/

That's a little too loose and open to mischief, so 1812 takes a hybrid approach, with URIs of the form:

/news/[id]/[slug]

The code uses the id to look up the entry and then looks up the slug in the entry to make sure it matches. If it doesn't then it generates a 404. That gives me cool URIs, fast look up, and yet doesn't let anybody monkey with my URIs.

To keep my old URIs still working I allowed an entry to have a blank slug, a feature I used when importing my old entries into the new system.

Comments

Once you have a generic store that's easy to use, well, that's a hammer, and when you have a hammer everything starts to look like a nail. Comments are just such a case. For each entry in the main blog there is an associated collection of comments, i.e. one collection per entry. Collections are cheap.

Other bits

There are other bits I will cover later, but the whole thing is built using Robaccia (and thus WSGI), and Selector. I'm currently using both Kid and Cheetah as I haven't decided which templating package to use, and may end up using more than one in the end.

2007-01-02