Recently I commented that a Bloom Filter would be a good choice for a resource representation in response to Roy's comments on Evan and Kellan's presentation at OSCON.
If you aren't familiar with Bloom filters, the description from Wikipedia is a good start:
The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.
So here's the motivating example from Evan and Kellan's presentation:
On July 21st, 2008, friendfeed crawled flickr 2.9 million times to get the latest photos of 45,754 users, of which 6,721 of that 45,754 potentially uploaded a photo.
Their conclusion is that polling sucks. Roy points out that polling sucks if you are polling the wrong kind of resource, in this case 45,754 individual feeds, as opposed to a composite resource which represents the updated status of all the users on Flickr, and he presents a solution which uses a GIF as a bit vector representation. My proposal is to use a Bloom Filter which has a different set of characteristics over the bit vector representation.
Now ideas are nice, but my note was nothing more than an admonishment to "just" use bloom filters. Working code would be better, so I threw together a very simple Bloom Filter implementation. It is simple because I fixed many of the parameters that go into a Bloom Filter, fully realizing that a more general solution would have a more complex implementation, but all of the analysis I do will still apply.
In building this implementation I made a couple of assumptions:
I will presume that all of those 6,721 updates did not come in the same hour, and that they were distributed over a full 24 hour period. If so then a Bloom filter that was tuned to contain 3,000 keys with a false positive rate of 1% would be more than sufficient. We are going to have a simple system that creates a resource for each hour of the day that represents the users that have uploaded photos in the past hour. We can use a simple URI scheme for each hour of the day:
http://example.org/2008/09/03/00 http://example.org/2008/09/03/01 ... http://example.org/2008/09/03/23
Each one of those URIs identifies a resource that contains a Bloom filter that contains all the user IDs of the feeds that have been updated in that hour. That is, the keys of the Bloom filters are user IDs and to use the system you would download the Bloom filter for the last hour and then test for membership of the IDs you are interested in. If you get a positive result for the membership test then you would go off and retrieve that users feed to get the latest entries. The URI layout will allow you to go back in time if you missed an hour.
We'll come back to the representation we'll use for the Bloom filter a little later.
Now reading the description of the Bloom Filter on Wikipedia at first glance makes it seem difficult with the need to create k hashes for each key you are adding to the filter. If we want to just add names it seems like we have to find a bunch of different hash functions to generate the k bit indexes, but the article contains information on a great simplification:
The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k-1) to a hash function that takes an initial value; or add (or append) these values to the key.
So my solution is to use a hash function, sha1, which generates a hash 160 bits long, and then chop up the 160 bits into 7 hashes each 20 bits long. If you need to generate more bits than 160 you could use sha256 or sha512, or barring that you could use sha1 on a sequentially modified key value:
sha1("0" + key) sha1("1" + key) ... sha1("9" + key)
When working with a Bloom filter you have control over several parameters that determine the behavior of the filter. The first parameter is p, which is the false positive rate, i.e. if you test for any value that isn't in the set there's a 1% chance the test will come back positive, indicating that the key is in the set. The second parameter is n which is the number of keys that will be in the set, in this case I've set that to 3,000. You can add less that 3,000 keys to this Bloom filter and the false positive rate will be lower, and add more than 3,000 keys and the false positive rate will be higher. The 1% false positive rate is for where there is exactly 3,000 keys in the filter.
We are now left with figuring out two more parameters for the Bloom filter. The parameter m is the number of bits in the filter, and k is the number of hash functions to use. If you read the Wikipedia article it gives a formula for calculating the optimal value of k that minimizes the false positive rate, and a Bloom filter with a 1% error rate and an optimal value for k only needs 9.6 bits per key, so we'll pick a nice round 30,000 for m. From the formula we determine that the optimal number of hash functions k is 7.
The last hurdle is coming up with a data structure to store 30,000 bits, but that's easy in Python since long integers can be of any length. That makes setting bits in the fitler rather trivial. If the hash function produces a number between 0 and 29,999 then the python code to set that bit in the filter is:
filter |= 2 ** hash(key)
Using longs also makes coming up with a representation pretty easy, we can just use the hexidecimal representation of the long that we are using for the filter:
representation = hex(filter)
For the sake of extensibility we should stuff that hex string inside a JSON or XML document, but for now for the sake of exposition I'm going to use the raw hex string.
And here is how simple it is to use:
>>> import bloomnames >>> b = bloomnames.BloomNames() >>> b.add("jcgregorio") >>> b.add("barney") >>> "fred" in b False >>> "jcgregorio" in b True >>> b.getfilter() 20000708816...675924992L >>> hex(b.getfilter()) '0x8000000000....0000000L'
You will notice that with only two members in the fitler that the hexidecimal representation is mostly zeroes since there will be at most 14 non-zero nibbles. In general the fewer members you add to the filter the better it compresses. If this representation is served up over HTTP and the client accepts gzip compression then we get a huge savings:
>>> import zlib >>> len(hex(b.getfilter())) 7188 >>> len(zlib.compress(hex(b.getfilter()))) 87
Now compressability was also one of the advantages of Roy's GIF-based solution, so this isn't unique to the Bloom filter represenation, but the Bloom filter does have some other advantages. The first is that I don't have to publish the universal set, like you would have to do for the bit vector implementation. You can take any string and test for it's membership in this Bloom filter. It turns out that the formulas we used to calculate the number of bits m and the optimal number of hash functions k don't rely on the size of the universal set. If you wanted to protect the list of members, or at least not publish a directory of everyone that used your service you would prefer the Bloom based solution since you aren't required to publish the universal set, and the Bloom filter doesn't contain the list of keys in a discoverable format.
Another shocking realization is that the formulas also don't rely on the size of the keys you are storing in the filter. While in this example only short user names are being stored, you could just as easily store absolute URIs to the feeds, or even the text of entire books into the bloom filter and the calculations would remain the same. Where you pay for this is in the amount of time you spend calculating the hash of a key, which would be much longer for a book over a username.
Now the point of this isn't to set up a beauty contest between a bit vector and Bloom filter for a representation. As always, the point is to think more about resource design and representation when building RESTful systems, and hopefully you will add both bit-vectors and Bloom fitlers to your toolkit.