As the use cases of technology and the human to device count has increased, there has been an explosion of data generation. While this has been good for machine learning and the renaissance of neural networks, we have had to figure out how to store this deluge of data. The sheer amount of data being produced has pushed traditional data structures to their asymptotic limits, making them ineffective for our current culture of impatience.
This is where probabilistic data structures save the day. They provide both low memory usage and fast access speeds. The trade-off they make to get this size and speed is their accuracy. In this post we will examine the pros and cons as well as the implementations of two of the most popular: bloom filters and cuckoo filters.
A Little History
Bloom filters were created all the way back in 1970 by Burton Howard Bloom (hence the name), and their primary use it to test set membership. Their loss in accuracy comes from the possibility of false positives, but they guarantee no false negatives. So when checking membership the filter can be thought of as returning ‘possibly in the set’, or ‘definitely not in the set.’ Another limitation of the bloom filter is that once created, elements could not be added or removed.
Later this restriction was removed by adding to the storage space required for each element in the filter. This variant is called the Counting Bloom Filter. Since 1970 there has been many of these variants researched including the Scalable Bloom Filter proposed in 2007, which has the nice property of dynamically adapting its size.
Cuckoo filters were developed in 2014, and are technically another variant of the Bloom Filters. The name Cuckoo comes from the underlying Cuckoo Hashtable. This hashtable gets its name from the way it removes items when inserting into already occupied buckets. This is similar to the way the cuckoo bird ‘inserts’ its eggs into occupied nests, where the hatchling then ejects the other eggs.
Now that we have had a biology lesson lets get back to the technical details. Cuckoo Filters boast four main advantages over the standard bloom filters:
- It supports the addition and deletion of items (counting bloom filters remove this advantage)
- It’s much more simple to implement than the counting bloom filters
- For the same number of items stored, cuckoo filters have less false positives
- Faster speeds for lookups
The third advantage is especially true when the filter is ‘small’. The rest of the article will focus on the third and fourth claims.
For full implementation of the cuckoo hash table and the cuckoo filter in python see this Github repo. Below is a rundown of the technical implementation found there.
A Side Note on Cuckoo Hashtables
First to understand Cuckoo Filters we must understand Cuckoo Hashtables. Cuckoo Hashtables are a form of a chained hash table. This is a broad class of tables where collisions are placed into buckets attached to their insertion point.
The load factor of these tables is the number of elements () divided by the number of buckets (). When , the table size is doubled. This allows these hashtables to maintain an amortized insertion of . In the worst case a lookup would be when all elements were in one bucket.
Cuckoo hashing achieves a worst case lookup of , deletions , and insertions are amortized to . This is done by maintaining two tables of size m. These two tables each have one hash function ( and ). Every element is in either at in the first table, or at in the second table.
This achieves for both lookup and deletion because only two positions must be checked for each element . Insertions are trickier.
To insert an element, start by inserting it into table one. If is empty you are done. If not place there and push the element that was there to table two. Repeat this until you do not have any more elements bouncing between tables.
Sometimes this may lead to a cycle. A cycle occurs when we revisit the same slot with the same element to insert. When this occurs we must rehash by selecting a new and , and inserting all the elements into the new tables.
A Cuckoo filter is a simple extension of the cuckoo hashtable. It consists of a hashtable with two new features:
- It uses a unique identifier called a ‘fingerprint’ obtained by hashing an item into a certain number of bits
- It also allows for more than one item to be put in the same bucket, which is called the bucket size
For instance a Cuckoo (2,4) filter would have unique identifiers that are 2 bits long, and each bucket in the hash table could store 4 elements. In the paper describing the cuckoo filters, the authors found this specific setup (2,4) to be the most useful for a wide range of practical tasks.
Two other recommendations were made for practical applications of cuckoo filters:
- An interesting side effect of hashing ‘fingerprints’ is that they must be long to achieve space efficiency. They must be , where is the bucket size.
- Another additional advancement on the performance of the cuckoo filter is to sort each of the buckets. This is called a Semi-Sorting Cuckoo Filter.
Below are three images that sum up the practicality of using Cuckoo filters. All images were taken from a talk given by the Cuckoo filter authors.
First we see that Cuckoo filters with semi-sorting take less space than bloom filters, at the same false positive rate. Bloom filters do not become more space efficient until almost the 10% false positive rate.
Second we see that Cuckoo filters are fast. Their lookup speeds are the fastest among the data structures that allow for insertion and deletion. The only structure close to competing is the Standard Bloom filter, which does not support deletion.
Finally we see the insertion performance of the Cuckoo filter is the only metric where it is not the clear winner. Although at higher levels of table occupancy it becomes faster than the implementations of bloom filters. The insertion difference is small compared to the space efficiency and speed differences above.
It was interesting to investigate probabilistic data structures and read up on their history. They are tremendously useful to big companies struggling with this influx of data. Because of this it is a field full of research opportunities as new variants are coming out periodically.
I think it has been demonstrated the Cuckoo filters are the winner for now, but I don’t think Bloom filters or HyperLogLog structures are going away anytime soon.
Wikipedia https://en.wikipedia.org/wiki/Bloom_filter#CITEREFAlmeidaBaqueroPreguicaHutchison2007 ↩
Blog Post http://blog.fastforwardlabs.com/2016/11/23/probabilistic-data-structure-showdown-cuckoo.html ↩
Cuckoo Filter https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf ↩
Counting Bloom Filter https://ieeexplore.ieee.org/document/851975/ ↩
Cuckoo Hashing http://www.it-c.dk/people/pagh/papers/cuckoo-jour.pdf ↩
Standford Hashing http://web.stanford.edu/class/archive/cs/cs166/cs166.1146/lectures/13/Small13.pdf ↩
Cuckoo Slide Show http://slideplayer.com/slide/3953312/ ↩
Author’s Github https://github.com/efficient/cuckoofilter ↩