Benchmarking Integer Compression in Go

Fri, Oct 11, 2013

comments/feedback

Updated post: Go vs Java: Decoding Billions of Integers Per Second

tl;dr

Go Learn Project #6 - Integer Compression

It’s been 4 weeks since my last post and I have been BUSY! My day job has been busier than ever (in a good way) and has taken over many of my nights and weekends. However, I’ve not forgotten the “Go Learn” project and continued to tinker with Go as I find time.

For “Go Learn” project #6, I decided to port JavaFastPFOR over to Go. The JavaFastPFOR repo actually is a collection of integer encoding/decoding algorithms. However, the more interesting ones are the ones created by Daniel Lemire, including FastPFOR, BP32, BP128, etc. To borrow from the repo’s readme:

[JavaFastPFOR] is a library to compress and uncompress arrays of integers very fast. The assumption is that most (but not all) values in your array use less than 32 bits. These sort of arrays often come up when using differential coding in databases and information retrieval (e.g., in inverted indexes or column stores).

Some CODECs (“integrated codecs”) assume that the integers are in sorted orders. Most others do not.

Thank You, Daniel Lemire

As you may have noticed, this is the second Daniel Lemire project that I’ve ported over. The previous one is Bitmap Compression using EWAH in Go.

Danile Lemire is a computer science professor at TELUQ (Université du Québec) where he teaches primarily online. He specializes in Databases, Data Warehousing and OLAP, Recommender Systems and Collaborative Filtering, and Information Retrieval.

I won’t elaborate on how knowledgeable he is here because you can easily tell by reading his papers and blogs. I do want to mention how Daniel has been extremely helpful on my porting effort. He’s provided tremendous guidance and support, and went even as far as providing me access to one of his machines for running performance tests.

So thank you Daniel!

Decoding Billions of Integers per Second Through Vectorization

This project is inspired by Danile’s blog post, Fast integer compression: decoding billions of integers per second, and paper. As the paper states:

In many important applications—such as search engines and relational database systems—data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression

As part of this research, Daniel also made his code available in C++ and Java. The Go port is based on JavaFastPFOR.

However, because Go has no access to the SSE instruction sets (well, not without getting into C or assembly), I was not able to port over the SIMD versions of the algorithms.

The Port

The Go port is available on github. In this version, I’ve proted over six algorithms, including

I won’t go into details of how each of these algorithms work. If you are interested, I strongly encourage you to go read Daniel’s paper.

The Benchmarks

To benchmark these algorithms, I’ve created 5 sets of data.

Timestamps, IP addresses, and Latencies are essentially 3 columns from another dataset.

Along with benchmarking the integer encoding algorithms, I also benchmarked the Go implementations of Gzip, LZW and Snappy. Both Gzip and LZW are part of the Go standard library. Snappy is a separate project implemented by the Go developers.

The Results

All benchmarks are performed on a machine with Intel® Xeon® CPU E5-2609 0 @ 2.40GHz.

Compression Ratio (Lower is Better)

Compression Ratio

Compression ratio is measured in bits/integer. Before compression, the integers we are compressing are all 32-bit integers. This chart shows us how many bits are used for each integer after compression.

There are a few things you can observe from this chart:

Compression Speed (Higher is Better)

Compression Speed

Compression speed is measured in millions of integers per second (MiS). Note that:

Decompression Speed

Decompression Ratio

Decompression speed is measured in millions of integers per second (MiS). Note that:

The Conclusions

The Raw Result

The following is a Google spreadsheet that contains the raw result set.

comments/feedback

comments powered by Disqus