2013-12-04 Update #1: Read a very interesting thread on golang-nuts list on the performance of interface. Seems like using interfaces really affects performance. In Joshua’s test he saw a 3-4x performance difference. I decided to try this on the ring buffer implementation since I am currently using interface. A quick test laster, looks like NOT using interface increased performance 2.4x. For now the code is in the “nointerface” branch.
- This project implements a low-latency lock-free ring buffer for variable length byte slices. It is modeled after the LMAX Disruptor, but not a direct port.
- If you have only a single core, don’t use this. Use Go channels instead! In fact, unless you can spare as many cores as the number of producers and consumers, don’t use this ring buffer.
- In fact, for MOST use cases, Go channel is a better approach. This ring buffer is really a specialized solution for very specific use cases.
- Primary pattern of this ring buffer is single producer and multiple consumer, where the single producer put bytes into the buffer, and each consumer will process ALL of the items in the buffer. (Other patterns can be implemented later but this is what’s here now.)
- This ring buffer is designed to deal with situations where we don’t know the length of the byte slice before hand. It will write the byte slice to the buffer across multiple slots in the ring if necessary.
- The ring buffer currently employs a lock-free busy-wait strategy, where the producer and consumers will continue to loop until data is available. As such, it performs very well in a multi-core environment (almost twice as fast as Go channels) if you can spare 1 core per produer/consumer, but extremely poorly in a single-core environment (600 times worse compare to Go channels).
- You can find a lot of information on the LMAX Disruptor. The resources I used include Trisha’s Disruptor blog series, LMAX Disruptor main page, Disruptor technical whitepaper, and Martin Fowler’s LMAX Architecture article.
- You can also read more about circular buffer, producer–consumer problem.
Go Learn Project #7 – Ring Buffer
For project #7, I decided to do something slightly different. This time we will create a ring buffer that can support variable length byte slices, and leverage multi-cores using multiple goroutines.
Primary target use case is for a producer to read bytes from sockets, ZMQ, files, etc that require process, like JSON, csv, and tsv strings, at a very high speed, such as millions of lines per second, and put these lines into a buffer so other consumers can process these.
A good example is when we import files with millions of lines of JSON object. These JSON objects are read from files, inserted into the buffer, and then they are unmarshalled into other structures.
The goal is to process millions of data items per second.
There are several ways to tackle this. A queue or ring buffer is usually a good data structure for this. You can think of a ring buffer is just a special type of queue that’s just contiguous by wrapping itself. As Wikipedia said,
A circular buffer, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This structure lends itself easily to buffering data streams.
So, implementation-wise, we can tackle this problem using a standard ring buffer. The most common way of implementing a standard ring buffer is to keep track of a head and a tail pointer, and keep putting items into the buffer but ensuring that head and tail don’t cross each other.
In most implementations, the pointers are mod’ed with the ring size to determine the next slot size. Also, mutex is used to ensure only one thread is modifying these pointers at the same time.
In Go, an idiomatic and fairly common way is to use buffered channels to solve this problem. It is effectively a queue where a producer puts data items into one end of the channel, and the consumer reads the data items on the other end of the channel. There are certainly pros and cons to to this approach.
First, it is Go idiomatic! Need I say more… :)
This is probably the easiest approach since Go Channel is already a battle-tested data structure and it’s readily available. An example is provided below in the examples section. Performance-wise it is actually not too bad. In a multi-core environment, it’s about 60% of the performance compare to the ring buffer. However, in a single-core environment, it is MUCH faster. In fact, 1300 times faster than my ring buffer!
There is one major difference between using channels vs this ring buffer. When the channel has multiple consumers, the data is multiplexed to the consumers. So each consumer will get only part of the data rather than going through all the data. This is illustrated by this play. The current design of the ring buffer allows multiple consumers to go through every item in the queue. A obvious workaround is to send the data items to multiple channels.
One down side with my channel approach is that I end up creating a lot of garbage over time and will need to be GC’ed. Specifically, I am creating a new byte slice for each new data item. Again, there is workaround for this. One can implement a leaky buffer. However, because we don’t know how big the data items are before hand, it’s more difficult to preallocate the buffers up front.
There might actually be a way to implement the leaky buffer with a big preallocated slice. I may just do that as the next project. The goal is to see if we can avoid having to allocate individual byte slices and leverage CPU caching for the big buffer.
Lock-Free Ring Buffer
The way that I’ve decided to tackle this problem is to model the ring buffer after the LMAX Disruptor. If you haven’t read Martin Fowler’s article on LMAX Architecture, at this time I would recommend that you stop and go read it first. After that, you should go read Trisha’s Disruptor blog series that explains in even more details how the Disruptor works.
One thing to keep in mind is that the Disruptor-style ring buffer has significant resource requirement, i.e., it requires N cores, where N is the number of producers and consumers, to be performant. And it will keep cores busy by busy waiting (looping). So huge downside. If you don’t need this type of low latency architecture, it’s much better to stay with channels.
These benchmarks are performed on a MacBook Pro with 2.8GHz Intel Core i7 procesor (Haswell), and 16 GB 1600MHz DDR3 memory. Go version 1.2rc5.
The 2 channel consumers benchmark is a single producer sending to 2 channels, each channel consumed by a separate goroutine. So it is apples-to-apples compare to the ring buffer 2 consumers benchmark.
You can see clearly the requirement of 1 core per consumer/producer in the ring buffer implementation (first 6 lines). Without that, performance suffer greately!
Also notice that the channel benchmark (last 6 lines) is faster for a single core than multi-cores. This is probably due to your friendly cache at play.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Go Channel: 1 Producer and 1 Consumer
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Ring Buffer: 1 Producer and 1 Consumer
This test function creates a 256-slot ring buffer, with each slot being 128 bytes long. It also creates 1 producer and 1 consumer, where the producer will put the same byte slice into the buffer 10,000 times, and the consumer will read from the buffer and then make sure we read the correct byte slice.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
Ring Buffer: 1 Producer and 2 Consumers
As mentioned before, the ring buffer supports multiple consumers. This example shows how you would create two consumers.
See tl;dr on top.