Scalability of Fork Join Pool

Akka 2.0 message passing throughput scales way better on multi-core hardware than in previous versions, thanks to the new fork join executor developed by Doug Lea. One micro benchmark illustrates a 1100% increase in throughput!

The new 48 core server had arrived and we were excited to run the benchmarks on the new hardware, but it was sad to see the initial results. It didn’t scale! What was wrong?

The purpose of the used micro benchmark is to see how throughput of message send and receive is affected by increasing number of concurrent, active, actors sharing the same dispatcher. Pairs of actors send messages to each other, classical ping-pong. Load is increased by adding more pairs of actors that are processing messages in parallel with other actors.

Full source code of the benchmark: TellThroughputPerformanceSpec.scala

Hardware and configuration:

  • Processor: 48 core AMD Opteron (4 dual-socket with 6 core AMD® Opteron™ 6172 2.1 GHz Processors)
  • Memory: 128 GB ECC DDR3 1333 MHz memory, 16 DIMMs
  • OS: Ubuntu 11.10
  • JVM: OpenJDK 7, version “1.7.0_147-icedtea”, (IcedTea7 2.0) (7~b147-2.0-0ubuntu0.11.10.1)
  • JVM settings: -server -XX:+UseNUMA -XX:+UseCondCardMark -Xms1024M -Xmx2048M -Xss1M -XX:MaxPermSize=128m -XX:+UseParallelGC
  • Akka version: 2.0-M4
  • Dispatcher configuration other than default:
    core-pool-size:48 of thread-pool-exector
    parallelism:48 of fork-join-exector

When using thread pool executor (java.util.concurrent.ThreadPoolExecutor) the benchmark didn’t scale beyond 12 parallel actors. Throughput was stuck at 1.4 million messages per second and didn’t increase with added load even though the 48 core box at a first glance was not heavily loaded, less than 10% of total cpu capacity was used.

First we thought the BIOS configuration was wrong, since the machine was new, but after going through all settings and turning off all power savings options the result was still as bad. We also bought more memory, to use 4 DIMMs per CPU for maximum memory bandwidth.

We noticed that the number of context switches was abnormal, above 70000 per second.

That must be the problem, but what is causing it? Viktor came up with the qualified guess that it must be the task queue of the thread pool executor, since that is shared and the locks in the LinkedBlockingQueue could potentially generate the context switches when there is contention.

Discussions with Doug Lea resulted in an improved implementation of the fork join pool, which we have embedded in akka-actor, and is used by default. The task queue is striped using randomized queuing and stealing. Read more about it here.

When running the same benchmark with the fork-join-executor the context switches were normal, around 1300 per second.

The results of the benchmark illustrates that the throughput scales with number of actors up to the number of cores (48) and saturates at around 20 million messages per second.

There are several things that can be tuned to achieve even higher throughput, which we will describe in follow up blog posts.

Recent comments

Blog comments powered by Disqus

6 Notes

  1. hakkers posted this