Instead of a list of changed user ids or URIs, we can represent the state as a sparse bit array corresponding to all of Flickr’s users. I don’t know exactly how many users there are at Flickr, but let’s be generous and estimate it at one million. One million bits seems like a lot, but it is only 122kB in an uncompressed array. Considering that this array will only contain 1s when an update has occurred within the last minute, my guess is that it would average under 1kB per representation.
I can just imagine people reading "sparse bit array" and thinking that I must be talking about some optimal data structure that only chief scientists would think to use on the Web. I’m not. Any black-and-white GIF or PNG image is just a sparse bit array, and they have the nice side-effect of being easy to visualize.
I'll see your GIF and raise you a Bloom Filter. Not that it in anyway detracts from the point of the article, which is to put more thought into resource design before jumping off into a new protocol.
Yes, just like Roy's GIF example, but with the possibility of false positives, which you trade off for much smaller representation, which in the case of the 1000x1000 black and white GIF might be hard to beat, but might be applicable in other areas.
Posted by Joe Gregorio on 2008-09-11
Posted by Pierre Phaneuf on 2008-09-12
I had actually thought about using Bloom Filters as soon as I saw the article. But someone much smarter than me , said matter-of-factly, that if the array is really sparse (which we can safely assume if the new ones are at 1/2 hour intervals) then 2 level resources would be much easier , cleaner and probably more effective.
Bloom Filters is too complicated and goes away from the "Least power" TAG Finding. Bloom filters requires you to share too much info, like the hash functions you used etc. etc.
2 level would be something like:
first resource return would be in groups of 1000 users. If any one in the 1000 has changed then that bit is 1. if It is one, you retrieve another resource , again made of a 1000 bits which tells you which particular user has changed.
So for a 10 million users (which is a obscenely obscenely high number), if groups of 1000 , then the first resource would be of 10000 bits.
HTH
Posted by anon on 2008-09-12
Posted by Markus on 2008-09-16
Posted by Ian Bicking on 2008-09-11