tl;dr

  • My Mutex vs CAS test results are very similar to the ones from the Disruptor whitepaper
  • Mutex is much slower than CAS (duh!)

Motivation

A while back I wrote a quick prototype ring buffer (Ring Buffer – Variable-Length, Low-Latency, Lock-Free, Disruptor-Style. It used atomic operations to check and increment counters. It proved to be faster than using Go channels under the right circumstances.

While research for that project, I also went through the Disruptor technical whitepaper several times to make sure I understand what’s going on underneath. If you haven’t gone through it, be sure to take a look.

One of the things that caught my attention is on page 2, where it discusses The Cost of Locks. On that page, it listed out the time it took to increment a counter 500 million times under different contention scenarios. Here’s the table from the paper:

Method Time (ms) Multiples
Single thread 300 1.00
Single thread with lock 10,000 33.33
Two threads with lock 224,000 746.67
Single thread with CAS 5,700 19.00
Two threads with CAS 30,000 100.00
Single thread with volatile write 4,700 15.67

Setup

The paper didn’t specify the machine spec that the tests were run on, so I thought I run a similar set of tests on a couple of machines I have access to. The setup for my tests are:

  • I had two machines. One is a Macbook Pro with 2.8 GHz Intel Core i7. The other is a SuperMicro server with Intel® Xeon® CPU E5-2609 0 @ 2.40GHz.
  • I ran the tests using Go version 1.3. go version go1.3 darwin/amd64 and go version go1.3 linux/amd64 respectively.
  • I ran the same set of tests using the command go test -cpu=1,2,3 -timeout 30m
  • I had Mutex tests for 1, 2, and 100 goroutines. For CAS, I had 1, 2, 1000, 100,000, and 1,000,000 goroutines. (I know, I should have added a 100 goroutine CAS test, but I didn’t feel like re-running the tests. And I don’t think the results will vary that much. :)
  • Both machines were fairly idle aside from this test.
  • Memories were 16 GB and 96 GB, respectively. Althought memory really didn’t play a role in this test.

Results

Here are some interesting results:

  • The above chart illustrates the huge difference between no locks (column group 1, probably can’t really see anything), mutex locks (column groups 2, 3, 4), and CAS (rest of the column groups).
  • The blue column is using 1 CPU core, the red column is using 3 CPU cores.
  • The baseline is column group 1 (Test1GoroutineNoLock), which we use the value of 1. All other values are multiples of baseline.

As we can see here, mutex is expensive. So I know that’s not earth-shattering news, but it’s still interesting to see.

  • For example, when using 1 CPU core, 2 goroutines using mutex is 93 times slower than 1 goroutine without any locks. And 2 goroutines using CAS is almost 40 times slower.
  • However, when using 3 CPU cores, 2 goroutines using mutex is now 250 times slower than 1 goroutine without locks. And 2 goroutines using CAS is 63 times slower.

We also see that the more goroutines contenting using mutex, the slower it gets. For example, 100 goroutines using mutex is 500 times slower than baseline when using 3 cores. However, when using CAS, even when there are 1 million goroutines, we are still doing pretty good (100 times slower than baseline.)

The result when running on the SuperMicro server is less varied compare to the Core i7 processor. 100 goroutines using mutex is only 60 times slower than baseline, and 1 million goroutines using CAS is only 20 times slower than baseline. However, the SuperMicro server is much slower than the MBP.

The raw results are available below:

The actual Go test program is below:

Comments/Feedbacks at Hacker News, Reddit

The past few evenings I’ve been working on a Go program that listens on a TCP socket, accepts new connections, processes some data and then return some data to the client. This is actually fairly simple in Go. The top of the net package has some fairly simple examples of how to do that. Another example can be found about mid-section of the page. You can also find a TON of examples online that shows you how to build a simple TCP listener.

However, one thing I noticed in all these examples is that none of them shows you how to gracefully shutdown the TCP listener. Most of the examples expect to program to exit so there’s no need to clean up anything. However, if you have a program that’s a long-running server, and need to, for whatever reason, need to shutdown the TCP listener, you will need to clean up after yourself. Otherwise you may leave a bunch of goroutines behind unintentionally.

Another reason I wanted a way to shutdown the TCP listener is I want to be able to start a listener in my tests, then start up a bunch of clients, test some stuff, then afterwards shutdown the server. Then I can start another listener in another test for some other tests.

After some help from jhoto and foobaz on the #go-nuts IRC channel, I wrote the following example to demonstrate the graceful shutdown approach.

The basic idea is to leverage a quit channel to tell the Accept() goroutine that it’s time to quit. Using quit channel is a fairly common practice in Go. However, in this case, the Accept() call is blocking waiting for new connections, so closing the quit channel won’t have any effect unless the goroutine actually checks it. So to force Accept() to return from blocking, we can close the net.Listener.

The order of the operation matters somewhat. We will want to first close the quit channel, then close the net.Listener. If we reverse the order, you will likely see a few more errors from the Accept() call.

The netgrace_test.go file below shows an example of how to use the quit channel to help gracefully shutdown net.Listeners.

Hopefully you will find this tip useful. You can find it as a gist as well.

Comments/Feedbacks at Hacker News, Reddit

Comments/feedbacks on Hacker News, Reddit

2013-12-04 Update #1: Read a very interesting thread on golang-nuts list on the performance of interface. Seems like using interfaces really affects performance. In Joshua’s test he saw a 3-4x performance difference. I decided to try this on the ring buffer implementation since I am currently using interface. A quick test laster, looks like NOT using interface increased performance 2.4x. For now the code is in the “nointerface” branch.

1
2
Benchmark1ProducerAnd1Consumer-3         5000000               353 ns/op
Benchmark1ProducerAnd1ConsumerInBytes-3 10000000               147 ns/op

tl;dr

  • This project implements a low-latency lock-free ring buffer for variable length byte slices. It is modeled after the LMAX Disruptor, but not a direct port.
  • If you have only a single core, don’t use this. Use Go channels instead! In fact, unless you can spare as many cores as the number of producers and consumers, don’t use this ring buffer.
  • In fact, for MOST use cases, Go channel is a better approach. This ring buffer is really a specialized solution for very specific use cases.
  • Primary pattern of this ring buffer is single producer and multiple consumer, where the single producer put bytes into the buffer, and each consumer will process ALL of the items in the buffer. (Other patterns can be implemented later but this is what’s here now.)
  • This ring buffer is designed to deal with situations where we don’t know the length of the byte slice before hand. It will write the byte slice to the buffer across multiple slots in the ring if necessary.
  • The ring buffer currently employs a lock-free busy-wait strategy, where the producer and consumers will continue to loop until data is available. As such, it performs very well in a multi-core environment (almost twice as fast as Go channels) if you can spare 1 core per produer/consumer, but extremely poorly in a single-core environment (600 times worse compare to Go channels).
  • You can find a lot of information on the LMAX Disruptor. The resources I used include Trisha’s Disruptor blog series, LMAX Disruptor main page, Disruptor technical whitepaper, and Martin Fowler’s LMAX Architecture article.
  • You can also read more about circular buffer, producer–consumer problem.

Go Learn Project #7 – Ring Buffer

For the past several projects (#6, #5, #4), I’ve mostly been hacking bits, and optimizing them as much as possible for a single core.

For project #7, I decided to do something slightly different. This time we will create a ring buffer that can support variable length byte slices, and leverage multi-cores using multiple goroutines.

Primary target use case is for a producer to read bytes from sockets, ZMQ, files, etc that require process, like JSON, csv, and tsv strings, at a very high speed, such as millions of lines per second, and put these lines into a buffer so other consumers can process these.

A good example is when we import files with millions of lines of JSON object. These JSON objects are read from files, inserted into the buffer, and then they are unmarshalled into other structures.

The goal is to process millions of data items per second.

Ring Buffer

There are several ways to tackle this. A queue or ring buffer is usually a good data structure for this. You can think of a ring buffer is just a special type of queue that’s just contiguous by wrapping itself. As Wikipedia said,

A circular buffer, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This structure lends itself easily to buffering data streams.

So, implementation-wise, we can tackle this problem using a standard ring buffer. The most common way of implementing a standard ring buffer is to keep track of a head and a tail pointer, and keep putting items into the buffer but ensuring that head and tail don’t cross each other.

In most implementations, the pointers are mod’ed with the ring size to determine the next slot size. Also, mutex is used to ensure only one thread is modifying these pointers at the same time.

Go Channels

In Go, an idiomatic and fairly common way is to use buffered channels to solve this problem. It is effectively a queue where a producer puts data items into one end of the channel, and the consumer reads the data items on the other end of the channel. There are certainly pros and cons to to this approach.

First, it is Go idiomatic! Need I say more… :)

This is probably the easiest approach since Go Channel is already a battle-tested data structure and it’s readily available. An example is provided below in the examples section. Performance-wise it is actually not too bad. In a multi-core environment, it’s about 60% of the performance compare to the ring buffer. However, in a single-core environment, it is MUCH faster. In fact, 1300 times faster than my ring buffer!

There is one major difference between using channels vs this ring buffer. When the channel has multiple consumers, the data is multiplexed to the consumers. So each consumer will get only part of the data rather than going through all the data. This is illustrated by this play. The current design of the ring buffer allows multiple consumers to go through every item in the queue. A obvious workaround is to send the data items to multiple channels.

One down side with my channel approach is that I end up creating a lot of garbage over time and will need to be GC’ed. Specifically, I am creating a new byte slice for each new data item. Again, there is workaround for this. One can implement a leaky buffer. However, because we don’t know how big the data items are before hand, it’s more difficult to preallocate the buffers up front.

There might actually be a way to implement the leaky buffer with a big preallocated slice. I may just do that as the next project. The goal is to see if we can avoid having to allocate individual byte slices and leverage CPU caching for the big buffer.

Lock-Free Ring Buffer

The way that I’ve decided to tackle this problem is to model the ring buffer after the LMAX Disruptor. If you haven’t read Martin Fowler’s article on LMAX Architecture, at this time I would recommend that you stop and go read it first. After that, you should go read Trisha’s Disruptor blog series that explains in even more details how the Disruptor works.

One thing to keep in mind is that the Disruptor-style ring buffer has significant resource requirement, i.e., it requires N cores, where N is the number of producers and consumers, to be performant. And it will keep cores busy by busy waiting (looping). So huge downside. If you don’t need this type of low latency architecture, it’s much better to stay with channels.

Performance Comparison

These benchmarks are performed on a MacBook Pro with 2.8GHz Intel Core i7 procesor (Haswell), and 16 GB 1600MHz DDR3 memory. Go version 1.2rc5.

The 2 channel consumers benchmark is a single producer sending to 2 channels, each channel consumed by a separate goroutine. So it is apples-to-apples compare to the ring buffer 2 consumers benchmark.

You can see clearly the requirement of 1 core per consumer/producer in the ring buffer implementation (first 6 lines). Without that, performance suffer greately!

Also notice that the channel benchmark (last 6 lines) is faster for a single core than multi-cores. This is probably due to your friendly cache at play.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ go test -bench=. -run=xxx -cpu=1,2,3
PASS
Benchmark1ProducerAnd1Consumer     10000            339807 ns/op
Benchmark1ProducerAnd1Consumer-2        10000000               258 ns/op
Benchmark1ProducerAnd1Consumer-3        10000000               260 ns/op
Benchmark1ProducerAnd2Consumers    10000            341859 ns/op
Benchmark1ProducerAnd2Consumers-2         200000             14967 ns/op
Benchmark1ProducerAnd2Consumers-3        5000000               340 ns/op
BenchmarkChannels1Consumer      10000000               241 ns/op
BenchmarkChannels1Consumer-2     5000000               436 ns/op
BenchmarkChannels1Consumer-3     5000000               446 ns/op
BenchmarkChannels2Consumers      5000000               319 ns/op
BenchmarkChannels2Consumers-2    5000000               647 ns/op
BenchmarkChannels2Consumers-3    5000000               570 ns/op

Examples

Go Channel: 1 Producer and 1 Consumer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func BenchmarkChannels(b *testing.B) {
  dataSize := 256
  data := make([]byte, dataSize)
  for i := 0; i < dataSize; i++ {
      data[i] = byte(i % 256)
  }

  ch := make(chan []byte, 128)
  go func() {
      for i := 0; i < b.N; i++ {
          // To be fair, we want to make a copy of the data, otherwise we are just
          // sending the same slice header over and over. In the real-world, the
          // original data slice may get over-written by the next set of bytes.
          tmp := make([]byte, dataSize)
          copy(tmp, data)
          ch <- tmp
      }
  }()

  for i := 0; i < b.N; i++ {
      out := <-ch
      if !bytes.Equal(out, data) {
          b.Fatalf("bytes not the same")
      }
  }
}

Ring Buffer: 1 Producer and 1 Consumer

This test function creates a 256-slot ring buffer, with each slot being 128 bytes long. It also creates 1 producer and 1 consumer, where the producer will put the same byte slice into the buffer 10,000 times, and the consumer will read from the buffer and then make sure we read the correct byte slice.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
func Test1ProducerAnd1ConsumerAgain(t *testing.T) {
  // Creates a new ring buffer that's 256 slots and each slot 128 bytes long.
  r, err := New(128, 256)
  if err != nil {
      t.Fatal(err)
  }

  // Gets a single producer from the the ring buffer. If NewProducer() is called
  // the second time, an error will be returned.
  p, err := r.NewProducer()
  if err != nil {
      t.Fatal(err)
  }

  // Gets a singel consumer from the ring buffer. You can call NewConsumer() multiple
  // times and get back a new consumer each time. The consumers are independent and will
  // go through the ring buffers separately. In other words, each consumer will have 
  // their own independent sequence tracker.
  c, err := r.NewConsumer()
  if err != nil {
      t.Fatal(err)
  }

  // We are going to write 10,000 items into the buffer.
  var count int64 = 10000

  // Let's prepare the data to write. It's just a basic byte slice that's 256 bytes long.
  dataSize := 256
  data := make([]byte, dataSize)
  for i := 0; i < dataSize; i++ {
      data[i] = byte(i % 256)
  }

  // Producer goroutine
  go func() {
      // Producer will put the same data slice into the buffer _count_ times
      for i := int64(0); i < count; i++ {
          if _, err := p.Put(data); err != nil {
              // Unfortuantely we have an issue here. If the producer gets an error 
              // and exits, the consumer will continue to wait and not exit. In the
              // real-world, we need to notify all the consumers that there's been
              // an error and ensure they exit as well.
              t.Fatal(err)
          }
      }
  }()

  var total int64

  // Consumer goroutine
  
  // The consumer will also read from the buffer _count_ times
  for i := int64(0); i < count; i++ {
      if out, err := c.Get(); err != nil {
          t.Fatal(err)
      } else {
          // Check to see if the byte slice we got is the same as the original data
          if !bytes.Equal(out.([]byte), data) {
              t.Fatalf("bytes not the same")
          }

          total++
      }
  }

  // Check to make sure the count matches
  if total != count {
      t.Fatalf("Expected to have read %d items, got %d\n", count, total)
  }
}

Ring Buffer: 1 Producer and 2 Consumers

As mentioned before, the ring buffer supports multiple consumers. This example shows how you would create two consumers.

Conclusion

See tl;dr on top.

Comments/feedbacks on Hacker News, Reddit

Comments/feedbacks on reddit, hacker news

2013-11-17 Update #2: Tried another new trick. This time I updated the leading bit position function, when using the gccgo compiler, to use libgcc’s __clzdi2 routine. This had the same effect as the update #1 except it’s for when gccgo is used. Performance increase ranged from 0% to 20% for encoding only. Thanks dgryski on reddit and minux on the golang-nuts mailing list.

2013-11-16 Update #1: Tried a new trick, which is to use an assembly version of bitlen to calculate the leading bit position. See the section below on “Use Some Assembly”.

So, before you crucify me for benchmarking Go vs Java, let me just say that I am not trying to discredit Go. I like Go and will use it for more projects. I am simply trying to show the results as objectively as I can, and hope that the community can help me improve my skills as well as the performance of the libraries.

I don’t consider myself a Go expert. I’ve been using Go for a few months and have been documenting the projects as I go in this blog. I’ve learned a ton and have applied many of the things I learned in optimizing this project. However, I cannot claim that I have done everything possible, so the performance numbers “could” still be better.

I would like to ask for your help, if you can spare the time, to share some of your optimization tips and secrets. I would love to make this library even faster if possible.

tl;dr

The following chart shows how much (%) faster Java is compare to Go in decoding integers that are encoded using different codecs. It shows results from processing two different files. See below for more details.

  • Daniel Lemire’s Java version is anywhere from 12% to 180% faster than my Go version for decoding integers. I didn’t compare the C++ version but given that the C++ version has access to SIMD operations, it can be much faster.
  • I tried many different ways to optimize my Go code for this projects, including using range for looping through slices, inlining simple functions, unrolling simple loops, unrolling even more loops, disabling bound checking (not generally recommended), using gccgo to compile, and used some assembly.
  • Using gccgo -O3 resulted in the highest performance. I tested using standard gc compiler, gc -B, gccgo, and gccgo -O3. The comparison above uses the gccgo -O3 numbers.
  • Using a range loop instead of unrolling a loop in one of the often used functions, AND compiling using gccgo -O3, I was able to get within 6% of Java version for Delta BP32 decoding. However, all of the other Go binaries suffered greatly.
  • This benchmark is purely a CPU benchmark. The test environment has enough memory to keep all the arrays in memory without causing swap, and there’s no disk IO involved in the actual encoding/decoding functions.

Project Review

A month ago I wrote about my “Go Learn” project #6: Benchmarking Integer Compression in Go (github). In that project I ported 6 different codecs for encoding and decoding 32-bit integers. Since then, I have ported a couple more codecs, cleaned up the directories, and performed a ton of profiling and optimization to increase performance.

There are now a total of 8 codecs available:

  • Binary Packing (BP32), FastPFOR, Variable Byte (varint) (top level directories)
    • Standard codec that encodes/decodes the integers as they are
  • Delta BP32, Delta FastPFOR (new), Delta Variable Byte (under delta/)
    • Encodes/decodes the deltas of the integers
    • These codecs generally produce much more compact representations if the integers are sorted
    • These codecs generally perform much faster, but there are some exceptions
  • ZigZag BP32, ZigZag FastPFOR (new) (under zigzag/)
    • Encode/decodes the deltas of the integers, where the deltas themselves are encoded using Google’s zigzag encoding

In addition, the benchmark program under benchmark/ is provided to let users easily test different integer lists and codecs.

Techniques Tried

I cannot say this enough: benchmark, profile, optimize, rinse, repeat. Go has made testing, benchmarking, and profiling extremely simple. You owe it to yourself to optimize your code using these tools. Previously I have written about how I was able to improve the cityhash Go implementation performance by 3-16X by Go profiling.

To optimize the integer encoding library, I followed the same techniques to profile each codec, and try as much as I can to optimzie the hot spots.

Below are some of the optimizations I’ve tried. Some helped, some didn’t.

For-Range Through Slices

I learned this when Ian Taylor from Google (and others) helped me optimize one of the functions using range to loop through the slices instead of for i := 0; i < b; i++ {} loops. The for-range method can be 4-7 times faster than the other way. You can see the difference between BenchmarkOffset and BenchmarkRange here.

I also found that gccgo -O3 can do some really good optimizations with simple range loops. You can see the difference with this gist. When using gc the standard Go compiler, BenchmarkRange (31.3 ns/op) is 56% slower than BenchmarkUnrolled (13.9 ns/op). However, then reverse is true when using gccgo -O3. BenchmarkUnrolled (8.92 ns/op) is 100% slower than BenchmarkRange (4.46 ns/op).

Side note: this set of benchmarks are courtesy of DisposaBoy and Athiwat in the #go-nuts IRC channel. Thanks for your help guys.

Unroll Simple Loops

For some simple, known-size, loops, such as initializing a slice with the same initial non-zero value, unrolling the loop makes a big difference. The caveat is that gccgo -O3 does an amazing job of optimizing these simple range loops, so in that case unrolling the loop is actually slower.

As an example, the following function is unrolled as

1
2
3
4
5
6
7
func deltaunpack0(initoffset int32, in []int32, inpos int, out []int32, outpos int) {
    out[outpos+0] = initoffset
    out[outpos+1] = initoffset
    out[outpos+2] = initoffset
    .
    .
    .

It was originally written as

1
2
3
4
tmp := out[outpos:outpos+32]
for i, _ := range tmp {
    tmp[i] = initoffset
}

When unrolled, AND using the standard gc compiler, we saw performance increase by almost 45% if this function is called often. However, as we mentioned above, when using gccgo -O3, the for-range loop is 33% faster than the unrolled method.

For now, I am keeping the unrolled version of the function.

Unroll Even More Loops

Given the success of unrolling the above simple loop, I thought I try unrolling even . more . loops to see if it helps.

It turns out performance actually suffered in some cases. I speculated that the reason may have to do with the bound checking when accessing slices. It turns out I might be right. After I disabled bound checking, performance increased when unrolling these loops. See below regarding disable bound checking.

Inline Simple Functions

There are several . small functions that gets called quite often in this encoding library. During profiling I see these functions being on top all the time.

I decided to try and inline these functions to see if the reduced function call overhead will help. In general I didn’t see much performance improvements using this technique. The most I saw was a 1% increase in performance. I contribute that to noise.

Disable Bound Checking (Generally NOT Recommended)

This integer encoding library operates on large slices of data. There’s a TON of slice access using index. Knowing the every slice access using index requires bound checking, I decided to try disabling bound checking using -gcflags -B to compile the code.

Disabling the bound checking for the Go compiler didn’t help as much as I hoped. For ts.txt, disabling bound checking increased performance by 10% for Delta BP32 encoding only.

However, if I disabled bound checking AND unrolled even more loops, we saw decoding performance increase anywhere from 10-40%. Encoding performance didn’t see much change. The following chart shows the performance increase (%) from the for-range loops to unrolled some additional loops when I disabled bound checking. This is comparing to the standard gc compiler.

The question is, is it worth the removal of bound checking? For the raw results, I kept the numbers from the for-range loops.

Using gccgo

When I was looking around for Go optimization techniques, I saw several posts on StackOverflow as well as the Go mailing list suggesting people try gccgo as the compiler if they wanted more performance. So I thought I give it a shot as well. After downloading gcc 4.8.2 and letting it compile overnight, I was finally able to test it out.

Compiling with gccgo without any optimization flags actually saw performance drop by 50-60%. The best performance result was achieved when I compiled using gccgo -O3. The comparison to Java uses the numbers from that binary.

As mentioned above, gccgo -O3 seems to do a pretty amazing job of optimizing simple range loops. I was able to achieve 33% performance increase using for-range with gccgo -O3 instead of unrolling the simple loop. The final result was within 6% of the Java version for Delta BP32 decoding.

Use Some Assembly

The final trick I tried is to convert one of the often used functions to assembly language. This was suggested to me almost 6 weeks ago by Dave Andersen on the golang-nuts Google group. He suggested that I steal the bitLen function in the math/big/arith files. That’s exactly what I did.

The bitlen function returns the position of the most significant bit that’s set to 1. It is most often called by encoding methods to determine how many bits are required to store that integers. So natually one would expect only the encoding functions will be improved.

That’s exactly what happened. Using the new bitlen assembly function, I was able to improve encoding performance by anywhere from 3% to 40%, depending on the codec. The most significant improvement was saw when Delta FastPFOR encoding was applied on latency.txt. It consistently saw ~40% performance increase.

As such, the code has been updated to use the assembly version of the bitlen.

Benchmark Environment

The system I used to run the benchmarks was graciously provided by Dr. Daniel Lemire. Here are the CPU and memory information at the time I ran the benchmarks. As you can see, we have plenty of memory to load the large integer arrays and should cause no swap. (I had this issue running these tests on my little MBA with 4GB of memory. :)

No disk IO is involved in this benchmark.

OS

1
2
3
4
5
6
NAME="Ubuntu"
VERSION="12.10, Quantal Quetzal"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu quantal (12.10)"
VERSION_ID="12.10"

CPU

1
2
3
model name      : Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz
cpu MHz         : 3591.566
cache size      : 10240 KB

Memory

1
2
3
4
         total       used       free     shared    buffers     cached
Mem:      32872068   13548960   19323108          0     209416   11517476
-/+ buffers/cache:    1822068   31050000
Swap:     33476604          0   33476604

java

1
2
3
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.10.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

go and gccgo

1
2
go version go1.2rc4 linux/amd64
gcc version 4.8.2 (GCC)

Files

There are two files used in this benchmark:

  • File 1: ts.txt contains 144285498 sorted integers. They are timestamps at 1 second precision. There are a lot of repeats as multiple events are recorded for that second.
  • File 2: latency.txt contains 144285498 unsorted integers. An example, lat.txt.gz, can be seen here.

Codecs

For this benchmark, I used 4 different codecs:

Because both BP32 and FastPFOR work on 128 integer blocks, the 58 remaining integers from the test files are encoded using Delta VariableByte and Variable Byte codecs, respectively. This is achieved using a Composition (Java, Go) codec.

The Java and Go versions of the codecs are almost identical, logic-wise, aside from language differences. These codecs operate on arrays (or slices in Go) of integers. There are a lot of bitwise and shift operations, and lots of loops.

Benchmark Results

The raw results from running different Go compilers (and flags) and codes are in the Google spreadsheet at the bottom of the post.

To be consistent, all the percentage numbers presented below are based on dividing the difference between the larger number (A) and the smaller number (B) by the smaller number (B).

(A – B) / B

So you can say that A is faster than B by X%.

Go Fastest

I compiled 4 different versions of Go binary for this benchmark:

  • benchmark.gc
    • This is built using the standard Go compiler, gc. The command is go build benchmark.o.
  • benchmark.gc-B
    • This is built using the standard Go compiler, gc, and I turned off bound checking for this version since the codecs deals with slices a lot. The command is go build -gcflags -B benchmark.o.
  • benchmark.gccgo
    • This is built using the gccgo compiler with no additional flags. The command is go build -compiler gccgo benchmark.o.
  • benchmark.gccgo-O3
    • This is built using the gccgo compiler with the -O3 flag. The command is go build -compiler gccgo -gccgoflags '-O3 -static' benchmark.o.

Surprisingly, disabling the bound checking for the Go compiler didn’t help as much as I hoped. For ts.txt, disabling bound checking increased performance by 10-40% for encoding only. For ts.txt decoding and latency.txt, it didn’t help at all.

Using the gccgo compiler with no flags had the worst performance. In general we saw that benchmark.gc (standard Go version) is about 110%-130% faster than the gccgo version.

Lastly, the gccgo-O3 version is the fastest. This is the version that’s compiled using -O3 flag for gccgo. We saw that the gccgo-O3 version is anywhere from 10% to 60% faster than the gc version.

Above is a chart that shows the decoding performance of the ts.txt file for the different binaries and codecs.

For the comparison with Java, I am using the gccgo-O3 numbers.

Bits Per Integer (Lower is Better)

As you can tell, when the integers are sorted (ts.txt), the compression ratio is VERY good. Delta BP32 achieved 0.25 bits per 32-bit integer, and Delta FastPFOR is even better at 0.13 bits per integer. This is also because there are a lot of repeats in ts.txt.

Because the timestamps are rather large numbers, e.g., 1375228800, when the non-delta codecs are used, they did not achieve very good compression ratio. We achieved ~31 bits per 32-bit integer using standard FastPFOR and BP32 codecs.

When the integers are NOT sorted, then we run into trouble. When delta codecs are used, a lot of deltas are negative numbers, which means the MSB for most of the deltas is 1. In this case, it’s actually better to use the standard codecs instead of the delta codecs. The standard codecs achieved ~24 bits per 32-bit integer, and the delta codecs were ~32 bits per 32-bit integer.

I also tested the latency file against the zigzag delta codecs and achieved ~25 bits per 32-bit integer. So it’s not much better than the standard codecs. However, zigzag delta comes in extremely handy when the negative numbers are smaller.

Java vs. Go

For this section, we are comparing the decoding speed between the fastest Go version and Java. As you saw at the beginning of this post. Daniel Lemire’s Java version is anywhere from 12% to 180% faster than my Go version for decoding integers.

The following chart shows how much (%) faster Java is compare to Go in decoding integers that are encoded using different codecs. It shows results from processing two different files. See below for more details.

The following chart shows the decoding performance while processing ts.txt.

The following chart shows the decoding performance while processing latency.txt.

Raw Results

The raw results are in the following Google spreadsheet.

Comments/feedbacks on reddit, hacker news

Comments/Feedback on Hacker News, Reddit

2013-11-19 Update #1: After more profiling (see top output in the “After modification” section), I’ve found that these 3 functions, unalignedLoad64, fetch64, and uint64InExpectedOrder, add up to quite a bit of execution time. I looked at the original cityhash implementation and realized that the combination of these functions is basically geting a LittleEndian uint64, which we can read by doing LittleEndian.Uint64(). So I updated fetch64 to do just that. Performance increased by ~20% just because of that. Also, the bloom filter test showed that cityhash is now faster than both FNV64 and CRC64.

Another month has gone by since the last post. This month has been extremely busy at work in an extraordinary good way. We had a huge POC that went quite succesfully at a large customer site. I also got a chance to visit Lisbon, Portugal as part of this POC. So things overall went pretty well.

However, since family and work pretty much occupied most of my waking hours over the past few weeks, I haven’t made much progress on the “Go Learn” projects. To keep myself going, I picked a smaller task over the weekend and decided to go back to “Go Learn Project #1”, my cityhash Go implementation.

Go Learn Project #1

When I first decided to learn Go, I struggled quite a bit to find a project that I can sink my teeth into. I am not sure if others are the same way, for me to learn a new language, I have to have something meaningful to work on. I can’t just write hello world programs or follow tutorials. I can read books and articles, but I will also procrastinate for weeks if I can’t find a relevant project.

Luckily, I was thinking about creating a data generator at work and wanted to write that in Go. But in order to write the data generator in Go, I first have to have a cityhash implementation in Go because our backend (C/C++) is using cityhash.

Surprisingly, I looked around but couldn’t find any Go implementation of cityhash. I would have thunk that given Go and Cityhash are both from Google, some Googler would have already ported cityhash over to Go. But no such luck, or maybe I just didn’t look hard enough. In any case, I decided to port cityhash over to Go.

Porting an existing project in C over to Go has a big advantage in that I don’t have to invent any new data structures or algorithms. It will allow me to focus on learning the Go syntax and the standard libraries. Many many moons ago (before I converted to the dark side) I was pretty proficient in C and reading cityhash wasn’t too difficult, so porting cityhash over to Go should be relatively straightforward.

In any case, the porting process wasn’t too difficult, as it turned out. Overall I was able to do that over a weekend. I was also able to port the test program (city-test.cc) over as well (vim substitution FTW) to validate that my implementation was functionally correct.

Performance Sucked

I had always suspected that my Go cityhash implementation wasn’t great performance wise. At the time I hadn’t learned how to benchmark or profile Go programs, so I didn’t do a whole lot except ensuring functionally the results are correct. Also my data generator at work was working fine so I left the implementation as is.

In September, I implemented a Bloom Filter package which required a hash function as part of the implementation. For that package, I tested different hash functions to see how they affect the performance of the bloom filter. As you can see from those benchmarks, Cityhash is consistently 3x slower compare to the others. At the time I knew it was because of my implementation but didn’t look into it further.

Profiling Go Cityhash

Since that first project, I have learned quite a bit more about benchmarking and profiling. So this weekend I finally took time to profile the Go implementation and found some interesting results. Now Go experts probably will read this and say “of course, you had no idea what you were doing.” And that would be true. I had no idea at the time. Hopefully this post will make up for it.

If you haven’t read this blog post on Go profiling, you should go read it now before continuing.

In any case, I wrote a short program to test cityhash with a big file. This way it can collect enough samples to tell me where the bottleneck is.

The original implementation took 45 seconds to hash a 1.1G file. Below is the cpu profile output.

1
2
3
4
5
6
7
8
9
10
11
12
duration = 45401991546ns
Total: 4295 samples
    2766  64.4%  64.4%     2766  64.4% runtime.memmove
     374   8.7%  73.1%      463  10.8% sweepspan
     259   6.0%  79.1%      383   8.9% MHeap_AllocLocked
     123   2.9%  82.0%      123   2.9% runtime.markspan
      95   2.2%  84.2%     1223  28.5% runtime.mallocgc
      74   1.7%  85.9%       74   1.7% runtime.MSpan_Init
      43   1.0%  86.9%     4234  98.6% github.com/reducedb/cityhash.unalignedLoad64
      40   0.9%  87.9%      595  13.9% runtime.MCache_Alloc
      32   0.7%  88.6%       32   0.7% runtime.markallocated
      31   0.7%  89.3%       56   1.3% MHeap_FreeLocked

As you can see, most of the time were spent copying memory (64.4%). And also the sweepspan (part of GC) is also running quite often (8.7%).

Before this, I had no idea that there’s that much memory being copied. So this is definitely interesting. I then looked the graph of the profile data using the “web” command.

It shows clearly that unalignedLoad64, a function that loads a uint64 from the buffer, is causing most of the memmove. Technically, it’s calling binary.Read(), which creates an array of 8 bytes, and passes to another function which eventually calls runtime.copy to copy a few bytes of data from the original buffer into the array.

So now the reason for the large amount of time spent in memmove is clear. Basically, every time I call binary.Read(), it creates an 8 byte array. Up to 8 bytes of data are copied into it. Then the data in the array gets converted into an uint64. After that, the array is thrown away. And this is done over and over again for the whole 1.1G file, which means 1.1G of memory is being created in tiny 8-byte chunks, copied, and thrown away. It’s no wonder the program is slow!

By this time, some of the readers are probably wondering why the heck I am using binary.Read() if I knew that I will be reading a uint64 from a slice. And they would be right again. Only excuse I have is that I had no clue and that was the first thing I found to work a few months back, so I just used it.

Modifying the Implementation

The change turned out to be relatively simple. Instead of using binary.Read(), I used LittleEndian.Uint64() to read the uint64. After the change, I ran the same program again.

Here are the results from the post-change run. The time it took to hash the 1.1G file is only 2.8 seconds. That’s 16X faster than before the change. The “top” profile output is also a lot more reasonable.

1
2
3
4
5
6
7
8
9
10
11
12
duration = 2718339693ns
Total: 245 samples
      57  23.3%  23.3%       57  23.3% encoding/binary.littleEndian.Uint64
      50  20.4%  43.7%      227  92.7% github.com/reducedb/cityhash.CityHash128WithSeed
      40  16.3%  60.0%       97  39.6% github.com/reducedb/cityhash.unalignedLoad64
      35  14.3%  74.3%      146  59.6% github.com/reducedb/cityhash.fetch64
      28  11.4%  85.7%       28  11.4% github.com/reducedb/cityhash.uint64InExpectedOrder
      27  11.0%  96.7%      132  53.9% github.com/reducedb/cityhash.weakHashLen32WithSeeds_3
       8   3.3% 100.0%        8   3.3% github.com/reducedb/cityhash.weakHashLen32WithSeeds
       0   0.0% 100.0%        1   0.4% MHeap_AllocLarge
       0   0.0% 100.0%      227  92.7% _/Users/jian/Projects/cityhash_test.TestLatencyIntegers
       0   0.0% 100.0%      227  92.7% github.com/reducedb/cityhash.CityHash128

Also, nothing really jumps out when looking at the post-change profile data graph.

Bloom Filter Benchmarks

Now that the changes are in, I went back and re-ran some of the bloom filter benchmarks. They look a lot more reasonable as well. Below is a comparison of the Scalable Bloom Filter. The post-change run is almost 2.5x faster than the pre-change run. Also, the post-change number (1442 ns/op) is a lot closer to some of the other hash functions (~1100 ns/op).

1
2
3
4
Scalable Bloom Filter
---------------------
BenchmarkBloomCityHash   1000000              1442 ns/op (after cityhash change)
BenchmarkBloomCityHash   1000000              3375 ns/op (before cityhash change)

Summary

The Go authors have made it extermely simple to test, benchmark and profile Go programs so there’s really reason for anyone not to do that often. It helps you see how your program works and where the bottlenecks are. It can also help you identify surprises that you may not have though of. A good example is in my case, I had no idea binary.Read() works the way it works until I profiled my program.

Comments/Feedback on Hacker News, Reddit

comments/feedback

Updated post: Go vs Java: Decoding Billions of Integers Per Second

tl;dr

  • Source code
  • Let me start by saying that I am not that happy with the performance numbers I got. It probably has more to do with my familiarity and expertise with Go than anything else. But still…
  • Having said that, I am pleasantly surprised how much faster some of the integer compression algorithms are compare to standard compression algorithms such as gzip, LZW and even snappy.
  • There’s no one size fits all solution. There’s always tradeoffs between compression ratio and compression/decompression speed. So depending on the type of integer data and how fast you want to encode/decode, you will need to choose the right solution.
  • Gzip does probably the best job in compression ratio but worst in terms of encoding/decoding performance.
  • Delta BP32 (IntegratedBinaryPacking in JavaFastPFOR) performs the best (ratio, encoding/decoding speed) for sorted integer arrays such as timestamps.
  • I would love it if someone can run these tests on a faster machine and send me the results. I would be happy to put them into the spreadsheet.

Go Learn Project #6 – Integer Compression

It’s been 4 weeks since my last post and I have been BUSY! My day job has been busier than ever (in a good way) and has taken over many of my nights and weekends. However, I’ve not forgotten the “Go Learn” project and continued to tinker with Go as I find time.

For “Go Learn” project #6, I decided to port JavaFastPFOR over to Go. The JavaFastPFOR repo actually is a collection of integer encoding/decoding algorithms. However, the more interesting ones are the ones created by Daniel Lemire, including FastPFOR, BP32, BP128, etc. To borrow from the repo’s readme:

[JavaFastPFOR] is a library to compress and uncompress arrays of integers very fast. The assumption is that most (but not all) values in your array use less than 32 bits. These sort of arrays often come up when using differential coding in databases and information retrieval (e.g., in inverted indexes or column stores).

Some CODECs (“integrated codecs”) assume that the integers are in sorted orders. Most others do not.

Thank You, Daniel Lemire

As you may have noticed, this is the second Daniel Lemire project that I’ve ported over. The previous one is Bitmap Compression using EWAH in Go.

Danile Lemire is a computer science professor at TELUQ (Université du Québec) where he teaches primarily online. He specializes in Databases, Data Warehousing and OLAP, Recommender Systems and Collaborative Filtering, and Information Retrieval.

I won’t elaborate on how knowledgeable he is here because you can easily tell by reading his papers and blogs. I do want to mention how Daniel has been extremely helpful on my porting effort. He’s provided tremendous guidance and support, and went even as far as providing me access to one of his machines for running performance tests.

So thank you Daniel!

Decoding Billions of Integers per Second Through Vectorization

This project is inspired by Danile’s blog post, Fast integer compression: decoding billions of integers per second, and paper. As the paper states:

In many important applications—such as search engines and relational database systems—data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression

As part of this research, Daniel also made his code available in C++ and Java. The Go port is based on JavaFastPFOR.

However, because Go has no access to the SSE instruction sets (well, not without getting into C or assembly), I was not able to port over the SIMD versions of the algorithms.

The Port

The Go port is available on github. In this version, I’ve proted over six algorithms, including

  • FastPFOR
  • BP32 (BinaryPacking in JavaFastPFOR)
  • Delta BP32 (IntegratedBinaryPacking in JavaFastPFOR)
  • ZigZag BP32 (BP32 with Delta encoding using ZigZag encoding method.)
  • VariableBytes
  • Delta VariableBytes (Integrated VariableBytes in JavaFastPFOR)

I won’t go into details of how each of these algorithms work. If you are interested, I strongly encourage you to go read Daniel’s paper.

The Benchmarks

To benchmark these algorithms, I’ve created 5 sets of data.

  • Clustered – This is a generated list of random and sorted integers based on the clustered model.
  • Uniform – This will generate a “uniform” list of sorted integers.
  • Timestamps – This is a list of timestamps in sorted order, extracted from another data set that’s mainly network monitoring data. The timestamps are epoch time (seconds since 1970). It has long runs of the same timestamp because the dataset contains many entries per second.
  • IP Addresses – This is a data set that contains a 32-bit integer representation of IPv4 addresses. This data set is NOT sorted.
    • In the real-world, one probably wouldn’t compress IP addresses like that. One would probably use dictionary encoding for the IPs, then encode the dictionary keys.
    • The dictionary keys will likely be much smaller numbers, which means it will compress fairly well.
  • Latencies – This list of integers represent latencies on the network. It is also unsorted.

Timestamps, IP addresses, and Latencies are essentially 3 columns from another dataset.

Along with benchmarking the integer encoding algorithms, I also benchmarked the Go implementations of Gzip, LZW and Snappy. Both Gzip and LZW are part of the Go standard library. Snappy is a separate project implemented by the Go developers.

The Results

All benchmarks are performed on a machine with Intel® Xeon® CPU E5-2609 0 @ 2.40GHz.

Compression Ratio (Lower is Better)

Compression Ratio

Compression ratio is measured in bits/integer. Before compression, the integers we are compressing are all 32-bit integers. This chart shows us how many bits are used for each integer after compression.

There are a few things you can observe from this chart:

  • Sorted integer lists ALWAYS perform better (ratio and speed) than unsorted integer lists.
  • Sometimes the compressed size is LARGER than the uncompressed size. This is because some of these algorithms use extra space to store encoding meta information.
  • Gzip, in general, has the best compression ratio.
  • Delta BP32 performs the best for sorted lists, but really has problems with unsorted lists.

Compression Speed (Higher is Better)

Compression Speed

Compression speed is measured in millions of integers per second (MiS). Note that:

  • Delta BP32 by far has the best compression speed across different data sets.
  • BP32 does a fairly decent job compressing as well, but its compression ratio is fairly poor.
  • Gzip and LZW perfom the most poorly.

Decompression Speed

Decompression Ratio

Decompression speed is measured in millions of integers per second (MiS). Note that:

  • Again, Delta BP32 does the best across different data sets, and gzip/LZW did the poorest.
  • BP32 decoded extremely fast for sorted timestamps that have large runs of the same timestamps.
  • FastPFOR seems to perform the most consistently for integer compression algorithms. (For compression as well.)

The Conclusions

  • For encoding sorted numbers, Delta BP32 has the best combination of compression ratio, and encoding/decoding speed.
  • For encoding unsorted numbers, Snappy seems like a good alternative.
  • For long term storage, it might be worth considering gzip. It provides the best compression ratio for data that may not be accessed for a long while.

The Raw Result

The following is a Google spreadsheet that contains the raw result set.

comments/feedback

comments/suggestions

tl;dr

  • If you are interested in database bitmap index, then it’s definitely worth learning more about bitmap compression.
  • EWAH is a very interesting way of compressing bitmaps and this article talks about my port of it to Go.
  • Sorry I think you should read the rest of the article. :)

The Real Deal

For “Go Learn” project #5, I’ve decided to continue implementing bit-related data structures. This time I decided to port a bitmap (not image) compression data structure to Go. Unlike bloom filters, which had quite a few implementations, I couldn’t really find any compressed bitmap implementations in Go. Most of them are in C/C++ or Java. (For previous projects please see previous blog posts.)

Btw, if you have any suggestions on what project I should do next, feel free to drop me a comment.

Bitmaps

Bitmap compression is often used for database bitmap indexing. To quote Wikipedia again:

A bitmap index is a special kind of database index that uses bitmaps. […]

Bitmap indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. Bitmap indexes have a significant space and performance advantage over other structures for query of such data. Their drawback is they are less efficient than the traditional B-tree indexes for columns whose data is frequently updated: consequently, they are more often employed in read-only systems that are specialized for fast query – e.g., data warehouses, and generally unsuitable for online transaction processing applications. […]

Some researchers argue that Bitmap indexes are also useful for moderate or even high-cardinality data (e.g., unique-valued data) which is accessed in a read-only manner, and queries access multiple bitmap-indexed columns using the AND, OR or XOR operators extensively.

There are quite a few approaches to bitmap compression. You will find most of the work listed in the wikipedia page. The two I found to be most interesting are CONCISE and Enhanced Word-Aligned Hybrid (EWAH). After reading throught both papers and looking at their Java implementations, I decided to start with EWAH. There’s no primary reason for starting with EWAH first. But Daniel Lemire’s comment intrigued me, “Out of the bitmap formats that I have often tested, Concise is the most concise, but JavaEWAH is often faster. So there is a trade-off.” I wanted to see how fast the Go implementation can be so I figure I start with EWAH. I may still port CONCISE later. (Note: I am not the best Go programmer nor do I claim that my code is the most performant. This is a learning project so there will definitely be room for improvement.)

EWAH

Enhanced Word-Aligned Hybrid (EWAH) is a bitmap compression algorithm created by Daniel Lemire, Owen Kaser, and Kamel Aouiche. (paper)

EWAH uses a method called Run-Length Encoding to perform compression. RLE is extremely useful for dataset that tend to have long blocks of the same data. In the case of database bitmap indices, there maybe long blocks of 1’s or 0’s, depending on the sparsity of the data. In cases where sparsity is low, compression can be extremely high. We will show some data points on this later.

EWAH uses two types of words where the first type is a 64-bit verbatim word. The second type of word is a marker word: the first bit indicates which clean word will follow, half the bits (32 bits) are used to store the number of clean words, and the rest of the bits (31 bits) are used to store the number of dirty words following the clean words. EWAH bitmaps begin with a marker word.

Clean (empty) words are either all 0’s or all 1’s. Dirty (literal) words are verbatim words that have 1’s and 0’s. Blocks of clean words can be compressed using RLE. Dirty words cannot be compressed so they are represented vertatim.

1
2
3
4
3210987654321098765432109876543210987654321098765432109876543210
0000000000000000000000000000010000000000000000000000000000000100
0000000000000000000001000000000000000000000000000000000000000001
0000000000000000000000000000000000000000001000000000000000000100

For example, the above bits show a marker word. Bits go from right to left. So the first bit being 0 means the clean/empty words all have their bits as 0. The next 32 bits indicate the number of clean words this is encoding. “10” in this case means this marker word encodes 2 clean words that are all 0’s.

The final 31 bits (starting at position 33 from the right) indicates the number of literal words that will follow this marker word. In thise case, there are also 2 dirty (literal) words that follow the marker word.

As you can see, the overall bitmap contains 3 64-bit words. However, it encodes 4 64-bit words. So there’s a space saving of 25%.

For more detailed information please read the paper. It’s definitely worth the read if you are trying to understand bitmap compression.

There are a few caveats to EWAH:

  1. There is no random access to any single bit, unlike a regular bit array. To access a random bit, you need to walk through the whole bitmap and find the word that contains the bit.
  2. You cannot set a bit that’s earlier than the last set bit. For example, if you have already set bit 100, you can not go back and set bit 99.

Technically you can solve both of these problems, but the cost of doing such operations will be extremely high.

Go Port

Most of my nights and weekends over the past couple of weeks have gone into the porting of JavaEWAH to Go. There’s also a C++ version of EWAH which I’ve also gone through. Code structure wise it’s very similar to the Java version. You can find the Go EWAH port on Github.

This is one of the more inovled ports I’ve ported to Go because it involved quite a few more classes and also the logic required much more understanding. With previous projects like Cityhash and Bloom Filter, I can get away with either translating directly without complete understanding (in the case of Cityhash) or the logic is fairly simple and can be coded up pretty easily (bloom filters.)

I wanted to be able to implement multiple bitmap compression algorithms down the road, so I started with a simple bitmap interface. It contains most the basic bit operations such as Set, Get, And, Not, Or, Xor, AndNot, Cardinality, etc. I then went through the Java source code and ported everything that’s required to implement these functions.

What this means is that my Go version is NOT a complete port of JavaEWAH. There are some java classes and methods I chose not to port as they weren’t required for satisfying the bit operations.

Initially I ported the classes over exactly as Java version had them. This included classes such as RunningLengthWord, BufferedRunningLengthWord, EWAHIterator, IteratingBufferedRunningLengthWord. These classes formed the bases for iterating over the bitmap and they are heavily used throughout the implementation for all the bit operations. Iterating over a bitmap usually involves multiple layers of nested iterators.

What I found is that in my Go port using this same approach was running fairly slow as there are multiple nested iterators. (Results later.) Now to be completely honest and fair, this does not mean the Java version is also slow. It’s just the My Go version is slow. As I have learned over the past few projects, there are always ways to make things faster. I tried to apply what I’ve learned in previous projects in these new projects but it certainly doesn’t mean I always succeed.

Cursors

After porting the Java version to Go directly, I learned a ton about the structure of the algorithm and how the different bit operations are performed. Given that my Go version was running fairly slow, I decided to try a different approach. Instead of the nested iterator approach, I implemented a Cursor structure that basically collapsed RunningLengthWord, BufferedRunningLengthWord, IteratingBufferedRunningLengthWord, and EWAHIterator into a single structure.

Whenever I need to iterate over the bitmap, I create a cursor (you can call it a iterator if you like but it’s just a single layer with no nesting) and use that to keep track of where I am. Then I loop over the bitmap and perform the necessary operations.

In many cases during my benchmarks, the cursor-based performance is 2-4x faster over the previous approach. I took great care to make sure the results of the two different implementations return exactly the same thing. You can see some of that in my tests.

Get

The original Java implementation did not have a Get function, which gets the bit at any random location. This is because bitmap implementation such as CONCISE and EWAH are meant to be used with bitmap indices and in most cases you don’t need to check a random bit. The common use cases are to perform bit operations on two bitmaps, then find all the bits that are set.

Regardless, I wanted to have a Get function duing my testing so I can check to see the bits I set are indeed set correctly. So I did a quick Get implemention by looping over the bitmap. Daniel was gracious enough to convert my Go implementation back into Java and incorporated it into JavaEWAH.

However, walking the bitmap each and every time we want to check a bit is way way way too slow. In some of the use cases I am thinking about I tend to sequentially check bits. For example, I will check bits 100, 159, 302, 405, etc. The bits are almost always increasing.

So taking a page from the skiplist implementation I did a few weeks back, I implemented Get with Fingers. The Finger concept is fairly simple. We basically keep a finger on the last checked word, and if the new bit to check is in or after the last checked word, we just move forward from there. If the new bit is BEFORE the current word, we restart from the beginning. The actually Get implementation keeps a getCursor as a finger to the last checked word.

The result is quite an improvement for sequential checks:

  • Get with Finger and Cursor – 125 ns/op
  • Get without Finger but uses Cursor – 60542 ns/op
  • Get using the nested iterators approach – 822530 ns/op

Results

The final product of this project is available on Github. I ran quite a few benchmarks with different bitmap density (1-sparsity) as well as running bit operations with two bitmaps of varying density. Here are some points to note.

Density vs Space Saving

The whole reason for bitmap compression to exist is to save space. So here are some results from that test. Probably not surprisingly, the higher the density, the lower the space savings. The charte below shows how density and space savings correlate for a bitmap that had a cardinality of 10,000.

Space Saving Bit Density
-0.62% 50.39% |
-0.04% 6.46% |
32.68% 0.67% |
91.61% 0.07% |
99.15% 0.01% |

My test data set is generated fairly uniformly. To get different density, I generate a random number between 1-N, where N is the number at the left-most column in the table. So on average the distance between bits is N/2, well, approximately. This may not be the best real-world scenario so take it with a grain of salt. But it does demonstrate how well EWAH can compress.

At the bottom, you will find the full list of bitmaps I generated for the tests. (Look at the bitmaps tab)

Cursor vs Nested Iterators

Another interesting benchmark I did was test the performance between the cursor-based implementation and the nested-iterators implementation. I tested the two implementations using a wide variaty of combinations, totaling 225 benchmarks per bit operation (AND, OR, XOR, ANDNOT).

The above gist is the shell script that generates the wrappers for calling a benchmark function in Go. Btw, I love the testing package for Go as it makes it really easy for me to create tests and benchmarks. However, not having the ability to run benchmarks with different parameters without creating wrappers like these is a huge pain the butt.

The result is quite telling. The cursor-based implementation is 2-4 times faster than the nested-iterator version. See the “Performance Results” tab below.

  • AND/OR/XOR/ANDNOT 1 – This is the cursor-based implementation
  • AND/OR/XOR/ANDNOT 2 – This is the nested-iterator-based implementation

Embedded Spreadsheet

comments/suggestions

discussion/comments

Update: @bradfitz commented that allocating a new slice during each add/check is probably not good for performance. And he is in fact correct. After removing the allocation for each add/check and doing a single slice make during New(), the performance increased by ~27%!! Here’s the gist containing the new ns/op results.


Another week, another “Go Learn” project. This time, in project #4, I implemented a Go package with several bloom filters and benchmarked them with various hash functions. (For previous projects, see #3, #2, #1.)

The goal of this project, for me at least, is to implement a pure Go package that doesn’t rely on wrappers to other langagues. There’s already quite a few bloom filters implemented in Go. But hey, in the name of learning, why not implement another one!

Bloom Filters

Wikipedia says,

A Bloom filter, conceived by Burton Howard Bloom in 1970 is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not; i.e. a query returns either “inside set (may be wrong)” or “definitely not in set”.

In my little project, I implemented the following three variants:

  • Standard – This is the classic bloom filter as described on Wikipedia.
  • Partitioned – This is a variant described by these papers. Basically instead of having a single big bit array, partitioned bloom filter breaks it into k partitions (or slices) s.t. each partition is m/k size, where m is the total number of bits, and k is the number of hashes. Then each hash function is assigned to a single partition.
  • Scalable – This is yet another variant described by here). The idea is that the standard bloom filter requires that you know m, or the size of the filter a priori. This is not possible for use cases where data continue to come in without bound. So the Scalable bloom filter utilizes multiple bloom filters, each with incresing k, but decreasing P where P is the desired error probability. This bloom filter also introduces r, which is the error tightening ratio, and it’s 0 < r < 1.

There are a ton more variants for bloom filters. You can raed all about them in this paper and this paper.

Hash Functions

To add an element, bloom filters hashes the element using k hashing functions, identifying k positions in the bit array, and setting those bits to 1. To check for an element, you essentially do the same thing (hash the element with k hash functions) and then check to see if all the bits at those positions are set. If all set, then most likely the element is available. If any of them are not set, then the element is definitely not avaiable. So bloom filters can have false positives, but not false negatives.

However, actually running k hash functions is quite expensive. So Kirsch and Mitzenmacher determined that by using a single hash function, but using the formula, gi(x) = h1(x) + i * h2(x), to calculate k hash values, the performance is actually similar. So this is what I used here.

For the values h1 and h2, we used the first 4 bytes returned from each hash function for h1, and second 4 bytes for h2.

The following hash functions are used for this little experiement:

  • FNV-64 – There’s a built-in Go package for this. So that’s what I used.
  • CRC-64 – Again, using the built-in Go package.
  • Murmur3-64 – There’s no built-in package so I used this one.
  • CityHash-64 – Again, no built-in so I am using the one I implemented.
  • MD5 – Using the built-in Go implementation.
  • SHA1 – Using the built-in Go implementation.

As a side note, MD5 and SHA1 return 128 bit hash values. Since we only use the first 64 bits, we throw away the last 64 bits.

Test Setup

The machine that ran these tests is a Macbook Air 10.8.4 1.8GHz Intel Core i5 4GB 1600MHz DDR3.

For the false positive test, I added all the words in /usr/share/dict/web2 (on my Mac) into each of the bloom filters. To check for false positives, I check for all the words in /usr/share/dict/web2a in each of the bloom filters. The two files should have completely different set of words (AFAICT).

1
2
  235886 /usr/share/dict/web2
   76205 /usr/share/dict/web2a

For each bloom filter, I ran tests using the following combinations:

1
2
3
4
5
for i in element_count[236886, 200000, 100000, 50000]
    for j in hash_functions[fnv64, crc64, murmur3-64, cityhash-64, md5, sha1]
        run test with hash_functions[j] and element_count[i]
    end
end

Element count in this is the initial size I set for the bloom filter. The bit array size, m, and number of hash values, k, are then calculated from there. I also set P (error probability) to 0.001 (0.1%) and p (fill ratio, or how full should the bit array be) to 0.5. The idea for the element count is to see how the bloom filters will perform when it has a high fill ratio.

For the scalable bloom filter test, I also needed to add another dimension since it uses either standard or partitioned bloom filter internally. So

1
2
3
4
5
6
7
for i in element_count[236886, 200000, 100000, 50000]
    for j in hash_functions[fnv64, crc64, murmur3-64, cityhash-64, md5, sha1]
        for k in bloom_filter[standard, partitioned]
            run test with hash_functions[j] and element_count[i] and bloom_filter[k]
        end
    end
end

For the performance test, I added all the words in web2 into the bloom filters. I continue to loop through the file until b.N (Go benchmark framework) is met. So some of the words will be re-added, which should not skew our test results since the operations are the same.

You can see the tests in the github repo. Just look for all the _test.go files.

Test Results

The following is a summary of the test results. You can also feel free to look at all the gory details.

Note: FR = Fill Ratio, FP = False Positive

For the spreadsheet embedded below, here are some observations

  • The MD5 Go implementation I used maybe broken, or I am not using it correctly, or MD5 is just bad. You will see that the fill ration is VERY low, 1-6%. So the false positive rate is very high (89+%).
  • The CityHash Go implementation seems very slow. Could just be my implementation (anyone want to venture some optimization?). But functionally it’s correct.
  • Both standard and partitioned bloom filters use the same number of bits and there’s not a huge difference in fill ratio and false positive rate. (Ignoring the MD5 anomaly.)
  • Predictably, as fill ratio goes up, so does the false positive rate for both standard and partitioned bloom filters.
  • Scalable bloom filter uses more bits as it contines to add new base bloom filters when the estimated fill ratio reaches P which is set to 0.5 (50%).
  • Probably not surprisingly, Scalable bloom filter maintains a fairly low false positive rate.
  • However, you will also notice that the Scalable FP increases as the total number of base filters increase. This suggests that I may want to try a lower r (error tightening ratio). Currently I use 0.9, but maybe 0.8 is more appropriate.
  • Overall it seems FNV is probably good enough for most scenarios.
  • Also Scalable+Standard might be a good choice for anyone doing stream data processing.

This chart shows the performance (ns/op) for adding elements to the bloom filters. Overall the performance is very similar for the different bloom filter implementations.

Feedback

Feel free to send me any feedback via github issues or on Hacker News.

References

During this process I referenced and leveraged some of these other projects, so thank you all!

  • Referenced
    • willf/bloom – Go package implementing Bloom filters
    • bitly/dablooms – scaling, counting, bloom filter library
    • There’s also a bunch of papers, some of which I linked above.
  • Leveraged

discussion/comments

Most of my nights and weekends in the past week have been immersed in my new skiplist library, even sacrificing some family time (yikes, not good!). It’s one of those projects I picked up in order to learn Go.

Wikipedia defines a skip list as

A skip list is a data structure for storing a sorted list of items using a hierarchy of linked lists that connect increasingly sparse subsequences of the items. These auxiliary lists allow item lookup with efficiency comparable to balanced binary search trees (that is, with number of probes proportional to log n instead of n).

You can imagine using a skiplist data structure whenver you want to use a binary search tree.

This skiplist implementation also uses search fingers, based on William Pugh’s work, A Skiplist Cookbook. So the efficiency is O(log k) where k is the distance between the last searched/updated item and the current one.

Performance

These numbers are from my Macbook Air 10.8.4 1.8GHz Intel Core i5 4GB 1600MHz DDR3.

Notice the fastest time is BenchmarkInsertTimeDescending. This is because the keys for that test is generated using time.Now().UnixNano(), which is always ascending. And because the sort order of the skiplist is descending, so the new item is ALWAYS inserted at the front of the list. This happens to have the best case of O(1).

The next best time is BenchmarkInsertTimeAscending. This is still pretty good, but because the sort order is ascending, so the new items are ALWAYS inserted at the end. This required the skiplist to walk all the levels so it took a bit longer.

The other benchmarks should have the average O(log k) efficiency.

Example (Int)

You can see additional examples in the test file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Creating a new skiplist, using the built-in Less Than function as the comparator.
// There are also two other built in comparators: BuiltinGreaterThan, BuiltinEqual
list := New(skiplist.BuiltinLessThan)

// Inserting key, value pairs into the skiplist. The skiplist is sorted by key,
// using the comparator function to determine order
list.Insert(1,1)
list.Insert(1,2)
list.Insert(2,1)
list.Insert(2,2)
list.Insert(2,3)
list.Insert(2,4)
list.Insert(2,5)
list.Insert(1,3)
list.Insert(1,4)
list.Insert(1,5)

// Selecting items that have the key == 1. Select returns a Skiplist.Iterator
rIter, err := list.Select(1)

// Iterate through the list of items. Keys and Values are turned as interface{}, so you
// need to type assert them to your type
for rIter.Next() {
  fmt.Println(rIter.Key().(int), rIter.Value().(int))
}

// Delete the items that match key. An iterator is returned with the list of deleted items.
rIter, err = list.Delete(1)

// You can also SelectRange or DeleteRange
rIter, err = list.SelectRange(1, 2)

rIter, err = list.DeleteRange(1, 2)

Bultin Comparators

There are three built-in comparator functions:

  • BuiltinLessThan: if you want to sort the skiplist in ascending order
  • BuiltinGreaterThan: if you want to sort the skiplist in descending order
  • BuiltinEqual: just to compare

Currently these built-in comparator functions work for all built-in Go types, including:

  • string
  • uint64, uint32, uint16, uint8, uint
  • int64, int32, int16, int8, int
  • float32, float64
  • unitptr

discussion/comments

In one of the projects I am working on we are trying to select a message encoding scheme. The use case is pretty simple. We have a server that accepts data records from different clients, and our goal is support 100K messages per second per node. Our selection criteria are pretty simple as well:

  • Compactness: how compact is the post-encoding byte array
  • Performance: how fast is marshalling and unmarshalling of data

Out of the various encoding schemes out there, including BSON, MessagePack, ProtoBuf, Thrift, etc etc, we decided to test BSON and MessagePack due to their closeness to JSON (we use JSON format quite a bit internally).

Here’s a very short and simple Go program we wrote to test the performance. The libraries I am using are Go codec and Go BSON library. (I am sure one can argue that the library used affects the result, which I would agree. However, these are the best libraries for Go AFAICT.)

The results are as follows:

  • For marshalling, BSON is about 15-20% slower than MsgPack.
  • For Unmarshalling, BSON is 300% slower than MsgPack.
  • Size-wise, BSON is about 20% more than MsgPack.

So looks like msgpack is the way to go!

Special thanks(!) to realrocker in #go-nuts and Ugorji for their help in cleaning up my code and helping me figure out my problems.

Some notes about the codec library:

  • According to Ugorji, decoding without schema in the codec library uses largest type for decoding, i.e. int64, uint64, float64. So if you don’t have a schema, the result will not pass reflect.DeepEqual test unless you initially type convert to those types. This is documented here
  • Ugorji also quickly fixed a bug where “codec should be able to decode into a byte slice from both msgpack str/raw format or msgpack bin format (new in finalized spec).” AWESOME response time!
  • To get string support (encode/decode strings), make sure RawToString is set to true (see below)

discussion