Roy Fielding comments on the OSCON presentation "Beyond REST? Building Data Services with XMPP PubSub" by Evan Henshaw-Plath and Kellan Elliott-McCrea.

Instead of a list of changed user ids or URIs, we can represent the state as a sparse bit array corresponding to all of Flickr’s users. I don’t know exactly how many users there are at Flickr, but let’s be generous and estimate it at one million. One million bits seems like a lot, but it is only 122kB in an uncompressed array. Considering that this array will only contain 1s when an update has occurred within the last minute, my guess is that it would average under 1kB per representation.

I can just imagine people reading "sparse bit array" and thinking that I must be talking about some optimal data structure that only chief scientists would think to use on the Web. I’m not. Any black-and-white GIF or PNG image is just a sparse bit array, and they have the nice side-effect of being easy to visualize.

I'll see your GIF and raise you a Bloom Filter. Not that it in anyway detracts from the point of the article, which is to put more thought into resource design before jumping off into a new protocol.

Thinking it through... so, you'd have a bloom filter that represented, say, all the users who have updated between noon yesterday and right now? Then anyone can fetch that and determine if any party they are interested in might need to be updated. You'd presumably be constructing many of these filters for different time windows, and the client would select the smallest window that includes the last time they updated.

Posted by Ian Bicking on 2008-09-11

Ian,

Yes, just like Roy's GIF example, but with the possibility of false positives, which you trade off for much smaller representation, which in the case of the 1000x1000 black and white GIF might be hard to beat, but might be applicable in other areas.

Posted by Joe Gregorio on 2008-09-11

Bloody bloom filters! I swear I saw one crawl behind one of the food trays at my nearby micro-kitchen!

Posted by Pierre Phaneuf on 2008-09-12

Hi

I had actually thought about using Bloom Filters as soon as I saw the article. But someone much smarter than me , said matter-of-factly, that if the array is really sparse (which we can safely assume if the new ones are at 1/2 hour intervals) then 2 level resources would be much easier , cleaner and probably more effective.

Bloom Filters is too complicated and goes away from the "Least power" TAG Finding. Bloom filters requires you to share too much info, like the hash functions you used etc. etc.

2 level would be something like:

first resource return would be in groups of 1000 users. If any one in the 1000 has changed then that bit is 1. if It is one, you retrieve another resource , again made of a 1000 bits which tells you which particular user has changed.

So for a 10 million users (which is a obscenely obscenely high number), if groups of 1000 , then the first resource would be of 10000 bits.


HTH

Posted by anon on 2008-09-12

Hi, Yes, Bloom Filters also came to my mind after reading this article. Actually the advantage of Bloom Filters would be that you would not need to share the numbering of the users. The numbering of users might be much more difficult to agree on, than for example the hash functions. One should have applied a patent for that, but now I think it's to late ;) Regards, Markus

Posted by Markus on 2008-09-16