Tim Bray On Search APIs

Tim Bray has posted a nice summary of what he thinks a a search API should look like: On Search: Interfaces

Most of it I agree with, but I think there are advantages to having it a bit more RESTful.

His query interface I agree, though I obviously have a bias towards the results set being expressed as an Atom feed.

The API for managing postings isn't as RESTful as I think it could be. In his proposal Tim uses an attribute 'op' that is used to indicate which action is to take place. It would be better if the verb were moved out of the body and into the HTTP method.

Initially adding a new resouce could still be done using a POST but have the response be a status code 303 with a Location: header with a URI. Here is an example 'add' request:

POST /cgi-bin/add.cgi HTTP/1.1
Host: 127.0.0.1:8085

<update href="http://example.com/herman">
 <posting word="call" wnum="0" />
 <posting word="me" wnum="1" />
</update>

And the response just refers to the URI that was just created:

HTTP/1.1 303 See Other
Content-Length: XXX
Location: http://127.0.0.1:8085/index/1
Content-type: text/plain

Entry created in the index.

That URI returned in the Location: header (http://127.0.0.1:8085/index/1) identifies that resource in the search engine. Do a GET on that URI to retrieve an XML document that describes the current state of that index entry. Do a PUT with an updated document to update the list of postings. And finally a HTTP DELETE on that URI will remove the resource from the index.

How is this reformulation better? First a GET is used to retrieve the current status of an index entry. Those responses can be optimized using caching, gzip, and ETags, thus reducing the bandwidth used. Secondly it gives each entry in the index it's own URI, which is a handly handle to have. Thirdly, since the state of an index entry is retrievable by a GET it can be combined easily with other web services. Lastly, since DELETE is used to remove a entry from the index, proxies and other intermediaries along the way have an opprotunity to remove the item from their caches. The last benefit is a nice benefit of uniform semantics, that is, the intermediaries can take the appropriate action based on the HTTP method DELETE without having to be programmed to understand the particulars of the content being passed in the request body.

I agree emphatically Joe. While having the interface to a search engine on the web is a great idea, the way Tim put 'op' inside the XML struck me as a red flag.

Posted by Vincent D Murphy on 2003-11-30

Hmm... I can see the point, but: First of all, if you're going to make the index status for some URI visible as a resource, why force external software to remember that name, so I'd just do

http://searchgine.example.com/index-status?r=http://example.com/herman

I can certainly see doing GETs on that.

On the other hand it does smell like optimizing for the uncommon case; direct GET access to the index status is not apt to be that interesting for most cases..

Also, I do think you need a verb for erasing postings.

Finally, just from the simplicity-of-understanding point of view: I think the resource, in the minds of the users, is the search engine itself. So POSTing to that resource is the most natural way to change its status. Lots more room for further discussion, but at the moment I'd lean to making the index status available as a resource, as you suggest, but running updates through POST to the search engine resource not the index resource.

Posted by Tim Bray on 2003-11-30

Here's another simplification, going from Tim's URI format. Instead, just POST the postings as follows:

POST edit.cgi?r=http://example.com/herman HTTP/1.1
Host: searchgine.example.com

In the above example, word zero was (re)set and word one was unset. This can be used create, edit, even delete the entry (by unsetting all set words). Of course, support for DELETE would likely make more sense for this purpose. As for GET, it could return a document that contained the "unset" postings, or could leave them out.

This all assumes that you can have only one posting per value of "wnum", which may be an invalid assumption.

Posted by Seairth Jacobs on 2003-11-30

Seairth: 'This all assumes that you can have only one posting per value of "wnum", which may be an invalid assumption.'

You could make that assumption, but it would close off a lot of interesting strategies. For example, I might want to record two or three postings for synonyms or variant spellings or whatever of the same word: consider quebec and qu&eacu;bec. Of course, you can do that kind of synonym processing at search rather than index time, but it would be nice optionally not to have to.

This is why I think you need a delete-postings verb.

The more I think about it the less I am impressed with the notion of treating the indexing for a resource as another resource. The important resources offered by the search engine, the reason it exists, are search results. Anything that adds the slightest difficulty or complexity to doing a good job on that has to have huge value to be cost-effective.

Posted by Tim Bray on 2003-12-01

Actually, the format could be:

<update>
<posting wnum="2">
<word>ishmael</word>
<word>ishmal</word>
<word>ishmale</word>
</posting>
</update>

This way, you could list multiple words for a given wnum, but otherwise have the same capabilities as I described above.

Posted by Seairth on 2003-12-01