Bloom Filter Resources

Recently I commented that a Bloom Filter would be a good choice for a resource representation in response to Roy's comments on Evan and Kellan's presentation at OSCON.

If you aren't familiar with Bloom filters, the description from Wikipedia is a good start:

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

So here's the motivating example from Evan and Kellan's presentation:

On July 21st, 2008, friendfeed crawled flickr 2.9 million times to get the latest photos of 45,754 users, of which 6,721 of that 45,754 potentially uploaded a photo.

Their conclusion is that polling sucks. Roy points out that polling sucks if you are polling the wrong kind of resource, in this case 45,754 individual feeds, as opposed to a composite resource which represents the updated status of all the users on Flickr, and he presents a solution which uses a GIF as a bit vector representation. My proposal is to use a Bloom Filter which has a different set of characteristics over the bit vector representation.

Now ideas are nice, but my note was nothing more than an admonishment to "just" use bloom filters. Working code would be better, so I threw together a very simple Bloom Filter implementation. It is simple because I fixed many of the parameters that go into a Bloom Filter, fully realizing that a more general solution would have a more complex implementation, but all of the analysis I do will still apply.

In building this implementation I made a couple of assumptions:

I will presume that all of those 6,721 updates did not come in the same hour, and that they were distributed over a full 24 hour period. If so then a Bloom filter that was tuned to contain 3,000 keys with a false positive rate of 1% would be more than sufficient. We are going to have a simple system that creates a resource for each hour of the day that represents the users that have uploaded photos in the past hour. We can use a simple URI scheme for each hour of the day:

    http://example.org/2008/09/03/00
    http://example.org/2008/09/03/01
    ...
    http://example.org/2008/09/03/23

Each one of those URIs identifies a resource that contains a Bloom filter that contains all the user IDs of the feeds that have been updated in that hour. That is, the keys of the Bloom filters are user IDs and to use the system you would download the Bloom filter for the last hour and then test for membership of the IDs you are interested in. If you get a positive result for the membership test then you would go off and retrieve that users feed to get the latest entries. The URI layout will allow you to go back in time if you missed an hour.

We'll come back to the representation we'll use for the Bloom filter a little later.

Now reading the description of the Bloom Filter on Wikipedia at first glance makes it seem difficult with the need to create k hashes for each key you are adding to the filter. If we want to just add names it seems like we have to find a bunch of different hash functions to generate the k bit indexes, but the article contains information on a great simplification:

The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k-1) to a hash function that takes an initial value; or add (or append) these values to the key.

So my solution is to use a hash function, sha1, which generates a hash 160 bits long, and then chop up the 160 bits into 7 hashes each 20 bits long. If you need to generate more bits than 160 you could use sha256 or sha512, or barring that you could use sha1 on a sequentially modified key value:

  sha1("0" + key)
  sha1("1" + key)
  ...
  sha1("9" + key)

When working with a Bloom filter you have control over several parameters that determine the behavior of the filter. The first parameter is p, which is the false positive rate, i.e. if you test for any value that isn't in the set there's a 1% chance the test will come back positive, indicating that the key is in the set. The second parameter is n which is the number of keys that will be in the set, in this case I've set that to 3,000. You can add less that 3,000 keys to this Bloom filter and the false positive rate will be lower, and add more than 3,000 keys and the false positive rate will be higher. The 1% false positive rate is for where there is exactly 3,000 keys in the filter.

We are now left with figuring out two more parameters for the Bloom filter. The parameter m is the number of bits in the filter, and k is the number of hash functions to use. If you read the Wikipedia article it gives a formula for calculating the optimal value of k that minimizes the false positive rate, and a Bloom filter with a 1% error rate and an optimal value for k only needs 9.6 bits per key, so we'll pick a nice round 30,000 for m. From the formula we determine that the optimal number of hash functions k is 7.

The last hurdle is coming up with a data structure to store 30,000 bits, but that's easy in Python since long integers can be of any length. That makes setting bits in the fitler rather trivial. If the hash function produces a number between 0 and 29,999 then the python code to set that bit in the filter is:

filter |= 2 ** hash(key)

Using longs also makes coming up with a representation pretty easy, we can just use the hexidecimal representation of the long that we are using for the filter:

representation = hex(filter)

For the sake of extensibility we should stuff that hex string inside a JSON or XML document, but for now for the sake of exposition I'm going to use the raw hex string.

Here is the project, and here is the highlighted source code for bloomnames.py.

And here is how simple it is to use:

>>> import bloomnames
>>> b = bloomnames.BloomNames()
>>> b.add("jcgregorio")
>>> b.add("barney")
>>> "fred" in b
False
>>> "jcgregorio" in b
True
>>> b.getfilter()
20000708816...675924992L
>>> hex(b.getfilter())
'0x8000000000....0000000L'

You will notice that with only two members in the fitler that the hexidecimal representation is mostly zeroes since there will be at most 14 non-zero nibbles. In general the fewer members you add to the filter the better it compresses. If this representation is served up over HTTP and the client accepts gzip compression then we get a huge savings:

>>> import zlib
>>> len(hex(b.getfilter()))
7188
>>> len(zlib.compress(hex(b.getfilter())))
87

Advantages

Now compressability was also one of the advantages of Roy's GIF-based solution, so this isn't unique to the Bloom filter represenation, but the Bloom filter does have some other advantages. The first is that I don't have to publish the universal set, like you would have to do for the bit vector implementation. You can take any string and test for it's membership in this Bloom filter. It turns out that the formulas we used to calculate the number of bits m and the optimal number of hash functions k don't rely on the size of the universal set. If you wanted to protect the list of members, or at least not publish a directory of everyone that used your service you would prefer the Bloom based solution since you aren't required to publish the universal set, and the Bloom filter doesn't contain the list of keys in a discoverable format.

Another shocking realization is that the formulas also don't rely on the size of the keys you are storing in the filter. While in this example only short user names are being stored, you could just as easily store absolute URIs to the feeds, or even the text of entire books into the bloom filter and the calculations would remain the same. Where you pay for this is in the amount of time you spend calculating the hash of a key, which would be much longer for a book over a username.

Summary

Now the point of this isn't to set up a beauty contest between a bit vector and Bloom filter for a representation. As always, the point is to think more about resource design and representation when building RESTful systems, and hopefully you will add both bit-vectors and Bloom fitlers to your toolkit.

Can you compare this design to FriendFeed's SUP protocol?

Posted by Steve Jones on 2008-10-19

Steve,

The storage for a single key with a 1% false positive rate in an optimal Bloom filter is 9.6 bits. Storage for a single key in SUP is roughly 8 bytes after compression. Because it uses md5 hashes it has some of the same benefits as a Bloom filter, such as not requiring the universal set to be exposed.

There are other advantages to Bloom filters in that you can control the false positive rate and Bloom filters can be sharded. That is, you can have multiple Bloom filters computed and if you want to create a filter for the union of all those sets then you just OR the filter values together.

All three solutions, SUP, Bloom filters, and bit-vectors have their pros and cons, but all three are vastly superior to polling large numbers of resources individually.

Posted by Joe Gregorio on 2008-10-19

OK, I see the size improvement. How do the approaches compare on latency?

Posted by Steve Jones on 2008-10-19

Very reminiscent of the Cache Digest protocol, which uses a bloom filter to build a representation of the contents of a Web cache, so that it can be exchanged with peers to give them insight into its probable contents, thereby increasing the chances of a sibling hit (as well as avoiding misses more often).

This has been used in Squid installations for several years, to very good effect. See their docs for more info.

To me, the really interesting thing here is the potential for implementing cache digests on non-cache origin servers to make their communication with Squid more efficient. And maybe advertising them through a site-wide metadata format…

What's old is new... thanks for stimulating the neurons (again) Joe!

Posted by Mark Nottingham on 2008-10-19

P.S. BTW, the digest is a resource, although it's currently hid behind a fake URI scheme. There's currently discussion on squid-dev to expose that and other internal resources (e.g., status and health info, as well as some control over the cache) through good old HTTP.

Posted by Mark Nottingham on 2008-10-19

A ruby implementation of BloomNames

You can still use BloomNames#add and BloomNames#contains, but those are just aliases to the more ruby-ish BloomNames#<< and BloomNames#include?. Oh, and you can get to the filter itself just from BloomNames#filter.

Posted by Jeff Hodges on 2008-10-19

Oh, and I just added a new BloomNames#inspect so that the huge @filter doesn't eat up screen real estate in an irb session. Stuck in a little info about how big the @filter number is, instead.

Posted by Jeff Hodges on 2008-10-19

This falls down if the requesting site doesn't know about all the potential things to look for in the bloom filter. If you just want a feed of (potentially unknown) things that map to some query or pattern, you need to resort to the XMPP solution or the callback based one I suggested.

Posted by joshua schachter on 2008-10-20

You'll probably find that SHA-1 is rarely a performance win in Bloom filter scenarios. While it has great diffusion properties it easily dominates all of the other performance costs. Kirsh and Mitzenmacher, whom I believe are referenced in the Wikipedia article, show that you can get nearly the same effect by generating a family of hashing functions from 2 hashing functions. Then you can use a fast lightweight dual hash function. This can be a huge win if performance is a consideration in exchange for a couple percent loss of space efficiency. I realize the principal draw of a Bloom filter is space not speed, but the time/space trade-off there is particularly attractive. Also, I describe a technique for growing a hybrid Bloom/linear hash table for doing distributed Bloom joins at http://comonad.com/reader/2008/linear-bloom-filters/

Posted by Edward Kmett on 2008-10-20

It's on the to do list to explain the nearly order of magnitude problems w/ Roy's math.

Unrelated to that, has anyone ever found a good implementation of a counting bloom filter? I've got one that I wrote to allow richer indexing of a class of machinetags, but it's under performing.

Posted by kellan on 2008-10-20

Edward, could you give some more specifics on what you'd consider a reasonable choice of hash functions in this case? I think I'm close to enlightenment, but need a little help.

Posted by Jeff Hodges on 2008-10-20

kellan: doesn't roy's solution force you to publish a list of all users? this seems problematic as well. i suppose you could add a userid to numeric id API. this forces one to use sequentially allocated userids, which is also bad.

Posted by joshua schachter on 2008-10-20

Bit of python to calculate optimal m and k from a desired error rate and number of entries:

def size_bloomfilter(capacity, error):
  m = math.ceil((capacity * math.log(error)) / math.log(1.0 / (math.pow(2.0, math.log(2.0)))))
  k = math.ceil(math.log(2.0) * m / capacity)
  return (int(m), int(k))

m, k = size_bloomfilter(3000, 0.01) # => 28756, 7

And same in Ruby:

def size_bloomfilter(capacity, error)
  m = ((capacity * Math.log(error)) / Math.log(1.0 / (2.0 ** Math.log(2.0)))).ceil
  k = (Math.log(2.0) * m / capacity).ceil
  [k,m]
end

This is translated from a library I wrote in C, which also supports backing the filter with mmap; seems to work quite nicely.

Also, I note Python and Ruby integers are immutable; using them like this would appear to cause a lot of copying. Could be a significant issue with larger filters. There's a BitSet class for Ruby somewhere, there should be similar for Python too.

Posted by Thomas Hurst on 2008-10-20

stdlib array (no bit, AFAIC, but array of integer is better than copying 30K integer in many cases. http://nightmare.com/software.html - npstruct

Posted by Jeremy Dunck on 2008-10-23