Parallelism For The Win!

Today I’d like to show you a real life usage example of a Go web app. You know, real life, for real! Not some artificial imagined problems. But let’s start from the beginning. I’m kinda pedantic in case of (graphic) design. I always wear clothes matching in a certain way, I like things to fit together. That’s why some time ago I wrote a small webservice which provides social network popularity statistics and a library to create custom looks for social buttons. I just couldn’t stand Facebook or Twitter buttons that don’t match the design of my website and I couldn’t do anything about it. Well, I used this stuff in a few places, eg. in one of my OSS projects website. That stuff was written in Ruby and worked quite fine, when I realized that Go can be ideal to write it more nicely.

How does it work?

Before we dive into the code, it’s important to know what it should be doing. This time spec is pretty simple – the app should get popularity information for a given link from specified external webservices. Converting it to a use case, it would be something like:

@@@no-highlight
As a user
I want to specify a url to get stats for
And want to select social networks to get stats from
When I perform requests
Then I want to get a response in a reasonable time
And to get a map of social networks associated with points for the url 

In this post we will focus only on the concurrency related stuff. Of course we are going to need some adapters for various webservices, Tweets, Facebook posts or Google+ shares, etc. We will skip them here, let’s just imagine them working.

Ruby (lack of) concurrency

Let’s talk about Ruby first… In theory, ruby 1.9 comes with real threads, but still they are limited by the Global Interpreter Lock, very similar as in Python. Speaking shortly, Ruby (and Python) just can’t deal with real concurrent applications. Lots of rubyists defends this aproach with arguments that GIL makes single threaded apps faster and makes writing of the multi-threaded apps easier. You know what… that’s total bullshit. Multi-threaded apps are supposed to be fast and scallable, not necessarily easy to write, while single threaded applications should slowly keep fading in the past. We’re in the time where multi-core machines are our daily basis, moreover it’s a requirement for running high loaded services. So Ruby, move your ass to the XXI century! You should learn from how Go deals with concurrency.

Concurrency… go, go, go!

Yeah, go is really the best word to describe how this language works:

@@@go
go SomeStuffToDo()

You should already know what this line does. Go runs a super-lightweight goroutine which under the hood is ballanced on the OS native threads. You can run thousands, hundred of thousands, or even millions of them and maintain reliable communication between. No locks, no mutexes, no heavy threads. Just goroutines communicating to each other across the channels. Here’s the ping-pong app…

@@@go
package main

import (
        "fmt"
        "math/rand"
        "time"
)

func hit(msg string, in <-chan string, out chan<- string) {
        for recv := range in {
                <-time.After(time.Duration(rand.Intn(500) + 100) * time.Millisecond)
                fmt.Println(recv)
                out <- msg
        }
}

func main() {
        ping, pong := make(chan string), make(chan string)

        go hit("Ping!", pong, ping)
        go hit("Pong!", ping, pong)

        ping <- "Ping!"
        <-time.After(10 * time.Second)
}

Ok, enough. Let’s write some real code now, as I promised.

Real stuff

Here, you can find the app I wrote up and running on egoistat.com – it’s open source of course, you can find the source on github. The ruby version I wrote some time earlier is also open source and can be found on github as well. Now, let’s compare this stuff.

Let’s take a look at Ruby first, here’s the code which interests us the most:

@@@ruby
def count(*counters)
  # *snip*

  res = {}
  sem = Mutex.new
  threads = []

  counters.each do |name|
    threads << Thread.new do
      c = send("#{name}_count") 
      sem.synchronize { res[name] = c }
    end
  end

  Timeout.timeout(10) do
    threads.each(&:join)
  end
rescue Timeout::Error
ensure
  return res.clone
end

Pretty straightforward. We take all the specified counters, iterate over them and get the results. Each request goes in a single thread. At the end we have to join all the threads, surrounded by a reasonable timeout. We also have to remember to return a clone of the results to avoid a race condition in case of timeout reached.

The code looks pretty simple, though. But it’s also very unreliable, especially when there’s a lot of requests to the #count method. In that case Ruby is gonna run plenty of heavy threads which are going to eat your machine resources very fast.

Picture or didn’t happened? Ok, here are some benchmarks… First we launch 20 requests in 1 process:

$ ab -n 20 -c 1 'http://localhost:9292/count.json?url=http://areyoufuckingcoding.me/&n=facebook,twitter,plusone'
Concurrency Level:      1
Time taken for tests:   20.265 seconds
Complete requests:      20
Failed requests:        0
Write errors:           0
Total transferred:      4820 bytes
HTML transferred:       760 bytes
Requests per second:    0.99 [#/sec] (mean)
Time per request:       1013.229 [ms] (mean)
Time per request:       1013.229 [ms] (mean, across all concurrent requests)
Transfer rate:          0.23 [Kbytes/sec] received

This run used 1.3% CPU and 1% of the memory (my machine has 4 core CPU and 4gb of memory). Now let’s run something bigger – 40 requests from 4 processes.

$ ab -n 40 -c 4 'http://localhost:9292/count.json?url=http://areyoufuckingcoding.me/&n=facebook,twitter,plusone'
Concurrency Level:      4
Time taken for tests:   39.190 seconds
Complete requests:      40
Failed requests:        0
Write errors:           0
Total transferred:      9640 bytes
HTML transferred:       1520 bytes
Requests per second:    1.02 [#/sec] (mean)
Time per request:       3918.962 [ms] (mean)
Time per request:       979.740 [ms] (mean, across all concurrent requests)
Transfer rate:          0.24 [Kbytes/sec] received

That sucks a bit 2x more requests in 4x more processes takes almost exactly twice the time, eating 2.7% of CPU and 1.5% of memory.

Go and kick Ruby’s ass

Here’s the same stuff written in Go.

@@@go
func (r *Request) Stat(networks ...string) (results ResultsGroup) {
        // *snip*

        var (
                fanin = make(chan *Result, len(networks))
                timeout = time.After(10 * time.Second)
                jobs = 0
        )

        for _, network := range networks {
                if counter, ok := FindCounter(network); ok {
                        jobs++
                        go func(network string) {
                                 fanin <- counter(r).In(network)
                        }(network)
                }
        }

        for ; jobs > 0; jobs-- {
                select {
                case partial := <-fanin:
                        results.Add(partial)
                case <-timeout:
                        return
                }
        }

        return
}

Aww, I already can hear all the Rubyists screaming that it has more lines of code and syntax is not as nice as in Ruby :P

Ok, let’s leave trolling aside and analyze the code. If you haven’t seen the video of Rob Pike’s “Go Concurrency Patterns” talk from Google I/O conference this is the time you should definitelly do it. What we’ve got here in our code, is simple Fan-in pattern. For each specified network we are running a goroutine which sends a message to the fanin channel. Later, we just run over and collect all the results. If some of them will not make it to come before the timeout, they just gonna be dropped.

There’s one important thing here in the code (thanks to zeebo and skelterjohn for helping me to figure this out) – namely, what happens with goroutines which didn’t send stuff in timeout? The loop will be over, so they will block on sending to fanin. Also they will keep references to that fanin channel, so that it will not be garbage collected and will stay in the memory… that sucks. One way to deal with this problem is to add buffer to the channel, like we did:

@@@go
fanin = make(chan *Result, len(networks))

We must specify the buffer size as a second parameter of make builtin func. In this case it will be the amount of specified networks, which is the maximum amount of goroutines we will fire. This approach is the most neat, and works fine, however only for small buffers. If we have a lot of networks to deal with, then we have to remove buffer from the channel and manually handle all timed out messages. We can do it in a post-collect loop, in a new goroutine:

@@@go
for ; jobs > 0; jobs-- {
        select {
        case partial := <-fanin:
                results.Add(partial)
        case <-timeout:
                go func() {
                        for ; jobs > 0; jobs-- {
                                <-fanin
                        }
                }()
                return
        }
}

Before we return after timeout, we must run a new goroutine which reads from the channel until all the jobs are done. After all the jobs are handled, there’s no more references to fanin so it can be garbage collected.

Now, when everything is clear, the most interesting part… benchmarks! The same as with the ruby app, we’re starting with 20 requests in 1 process.

ab -n 20 -c 1 'http://localhost:8080/api/v1/stat.json?url=http://areyoufuckingcoding.me/&n=facebook,twitter,plusone'
Concurrency Level:      1
Time taken for tests:   10.757 seconds
Complete requests:      20
Failed requests:        0
Write errors:           0
Total transferred:      2540 bytes
HTML transferred:       780 bytes
Requests per second:    1.86 [#/sec] (mean)
Time per request:       537.854 [ms] (mean)
Time per request:       537.854 [ms] (mean, across all concurrent requests)
Transfer rate:          0.23 [Kbytes/sec] received

Nothing fancy yet. It’s just 2x faster than ruby script. Obviously it’s because Go is compiled language and faster in general, and blah blah blah. Egoistat used max 4.1% of CPU and 0.2% of memory to deal with these requests. Now let’s try 40 requests in 4 processes, hang on tight…

$ ab -n 40 -c 4 'http://localhost:8080/api/v1/stat.json?url=http://areyoufuckingcoding.me/&n=facebook,twitter,plusone'
Concurrency Level:      4
Time taken for tests:   5.406 seconds
Complete requests:      40
Failed requests:        0
Write errors:           0
Total transferred:      5080 bytes
HTML transferred:       1560 bytes
Requests per second:    7.40 [#/sec] (mean)
Time per request:       540.596 [ms] (mean)
Time per request:       135.149 [ms] (mean, across all concurrent requests)
Transfer rate:          0.92 [Kbytes/sec] received

Anything to say my friends? This was running 8x faster than the ruby code! It ate aprox. 15% of CPU and 1% of memory. Now you’d probably ask why the Go code ate so much of CPU? The answer is simple, because it was running way more tasks in parallel. Ruby didn’t make it to run more than few tasks in its artificial parallel, that’s why it couldn’t get more from the CPU. One more thing we must have on mind… is it bad or good thing that Go ate more CPU? I think it’s good, it was kind of an equivalent change, it took available CPU and transformed it to the requests processing speed. Sounds fair for me.

Summary

Ok folks, that’s all for today. I hope you will enjoy Egoistat, don’t forget to spread the word about it. Thanks to this app I found a few really interesting things we can do in Go, and a few very weird quirks as well. Stay tuned, I’ll describe all of them in next posts.

Comments

blog comments powered by Disqus