Benchmarking Bloom Filters and Hash Functions in Go

Wed, Sep 4, 2013

discussion/comments

Update: @bradfitz commented that allocating a new slice during each add/check is probably not good for performance. And he is in fact correct. After removing the allocation for each add/check and doing a single slice make during New(), the performance increased by ~27%!! Here’s the gist containing the new ns/op results.

{% gist 6515577 %}


Another week, another “Go Learn” project. This time, in project #4, I implemented a Go package with several bloom filters and benchmarked them with various hash functions. (For previous projects, see #3, #2, #1.)

The goal of this project, for me at least, is to implement a pure Go package that doesn’t rely on wrappers to other langagues. There’s already quite a few bloom filters implemented in Go. But hey, in the name of learning, why not implement another one!

Bloom Filters

Wikipedia says,

A Bloom filter, conceived by Burton Howard Bloom in 1970 is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either “inside set (may be wrong)” or “definitely not in set”.

In my little project, I implemented the following three variants:

There are a ton more variants for bloom filters. You can raed all about them in this paper and this paper.

Hash Functions

To add an element, bloom filters hashes the element using k hashing functions, identifying k positions in the bit array, and setting those bits to 1. To check for an element, you essentially do the same thing (hash the element with k hash functions) and then check to see if all the bits at those positions are set. If all set, then most likely the element is available. If any of them are not set, then the element is definitely not avaiable. So bloom filters can have false positives, but not false negatives.

However, actually running k hash functions is quite expensive. So Kirsch and Mitzenmacher determined that by using a single hash function, but using the formula, *gi(x) = h1(x) + i * h2(x)*, to calculate k hash values, the performance is actually similar. So this is what I used here.

For the values h1 and h2, we used the first 4 bytes returned from each hash function for h1, and second 4 bytes for h2.

The following hash functions are used for this little experiement:

As a side note, MD5 and SHA1 return 128 bit hash values. Since we only use the first 64 bits, we throw away the last 64 bits.

Test Setup

The machine that ran these tests is a Macbook Air 10.8.4 1.8GHz Intel Core i5 4GB 1600MHz DDR3.

For the false positive test, I added all the words in /usr/share/dict/web2 (on my Mac) into each of the bloom filters. To check for false positives, I check for all the words in /usr/share/dict/web2a in each of the bloom filters. The two files should have completely different set of words (AFAICT).

  235886 /usr/share/dict/web2
   76205 /usr/share/dict/web2a

For each bloom filter, I ran tests using the following combinations:

for i in element_count[236886, 200000, 100000, 50000]
    for j in hash_functions[fnv64, crc64, murmur3-64, cityhash-64, md5, sha1]
        run test with hash_functions[j] and element_count[i]
    end
end

Element count in this is the initial size I set for the bloom filter. The bit array size, m, and number of hash values, k, are then calculated from there. I also set P (error probability) to 0.001 (0.1%) and p (fill ratio, or how full should the bit array be) to 0.5. The idea for the element count is to see how the bloom filters will perform when it has a high fill ratio.

For the scalable bloom filter test, I also needed to add another dimension since it uses either standard or partitioned bloom filter internally. So

for i in element_count[236886, 200000, 100000, 50000]
    for j in hash_functions[fnv64, crc64, murmur3-64, cityhash-64, md5, sha1]
        for k in bloom_filter[standard, partitioned]
            run test with hash_functions[j] and element_count[i] and bloom_filter[k]
        end
    end
end

For the performance test, I added all the words in web2 into the bloom filters. I continue to loop through the file until b.N (Go benchmark framework) is met. So some of the words will be re-added, which should not skew our test results since the operations are the same.

You can see the tests in the github repo. Just look for all the _test.go files.

Test Results

The following is a summary of the test results. You can also feel free to look at all the gory details.

Note: FR = Fill Ratio, FP = False Positive

For the spreadsheet embedded below, here are some observations

This chart shows the performance (ns/op) for adding elements to the bloom filters. Overall the performance is very similar for the different bloom filter implementations.

Feedback

Feel free to send me any feedback via github issues or on Hacker News.

References

During this process I referenced and leveraged some of these other projects, so thank you all!

discussion/comments

comments powered by Disqus