One of the biggest questions I encounter among users of Akka is how to use dispatchers to create failure zones and prevent failure in one part of the application from affecting another. This is sometimes called the Bulkhead Pattern. And once I create the failure zones, how do I size the thread pools so that we get the best performance from the least amount of system resources used? The truth is, every application is different, and you must build your system and measure its performance under load to tune it effectively. But here are some tips and tricks.
Most people building an Akka application start out with a single ActorSystem, using the default dispatcher settings with a minumum number of threads of 8 and a maximum number of threads of 64. As the application grows, they notice that futures time out more frequently since futures in Akka actors often use the actor’s dispatcher as their ExecutionContext implicitly.
Because of the limited resources of one thread pool for all actors and futures in their application, resource starvation is occurring. When that happens, I recommend that they identify actors using futures and consider where you can use a separate dispatcher or ExecutionContext for those futures so that they do not impact actors with their thread usage. We want to limit the impact of the work of those futures on the actors handling messages in their mailbox.
As a temporary workaround, I have noticed some people try to use a PinnedDispatcher for each actor so that starvation is less likely. Actors created with PinnedDispatcher will receive their own dedicated thread that lives up until the keep-alive-time configuration parameter of the ThreadPoolExecutor (default of 60 seconds) is not exceeded. However, this is really not a viable solution for production except for very specific use cases, such as service-oriented actors handling a lot of load. For most other tasks, you WANT to share resources among actors with similar roles and risk profiles so that you aren’t using large amounts of resources dedicated to each actor. In addition, starting and restarting threads takes time, and each has a default size of 512KB. You will use up your memory very quickly in a system that relies primarily on actors created with PinnedDispatcher.
The key to separating actors into failure zones is to identify their risk profile. Is a task particularly dangerous, such as network IO? Is it a task that requires blocking, such as database access? In those cases, you want to isolate those actors and their threads from those doing work that is less dangerous. If something happens to a thread that results in it completely dying and not being available from the pool, isolation is your only protection so that unrelated actors aren’t affected by the diminishment of resources.
You also may want to identify areas of heavy computation through profiling, and break those tasks out using tools such as Routers (no shared mailboxes and thus no work-stealing) and BalancingDispatcher (one mailbox for all “routees”, and therefore work-stealing in nature). For those tasks that you assign to Routers, you might also want them to operate on their own dispatcher so that the intense computation tasks do not starve other actors waiting for a thread to perform their work.
As a side note, I tell people in my Effective Actors presentation never to parallelize with Routers before you have measured the hot spots in your application. Too often, developers (including myself) make assumptions about where we expect performance to be slow and try to mitigate those problems up front. If you suspect you will need to use a Router somewhere, Akka Team Lead Roland Kuhn recommends that you pass around actor references from supervisors to compute-intensive children so that the ‘context.parent’ of their children are free to become a router later on with minimal impact. But do not create Routers before you have proven at runtime under load that you absolutely need to.
Now the question becomes how to size your dispatchers, and this is where the Typesafe Console can be very handy. In systems where you have several or many dispatchers, keep in mind that the number of threads that can be run at any time on a box is a function of how many cores it has available. In the case of Intel boxes where hyperthreading is available, you could think in terms of double the number of cores if you know that your application is less CPU-bound. I recommend sizing your thread pools close to the number of cores on the box where you plan to deploy your system, and then run your system under a reasonable load and profile with the Typesafe Console. You can then externally configure the thread pool sizes and check the impact at runtime.
When using the Typesafe Console, watch the Dispatchers view to see if the latency of message handling is within acceptable tolerances of your non-functional requirements, and if not, try adjusting the number of threads required upward. Remember, you’re setting the minimum number of threads, the maximum number of threads and the “parallelism-factor”. This is the ceiling of the number of cores on your box multiplied by that factor is calculated to determine the thread pool size, bounded by the max and min settings you give.
The Typesafe Console also shows you something else that is very important to watch - the size of each actor’s mailbox. If you see an actor whose mailbox is perpetually increasing in size, you need to retune the threads for its dispatcher or parallelize its task by making it a Router so that it has the resources it needs to keep up with the demands placed on it by the system. The receipt of messages into an actor’s mailbox can be bursty in nature, but you shouldn’t have actors with mailboxes that aren’t handling the traffic coming to them fast enough to keep up with the load.
Once you have an idea of the number of threads you need to handle burstiness in your application (if any), sit down with your team and figure out the minimum and maximum bounds of each thread pool. Don’t be afraid to add a few extra threads to the max to account for possible thread death in production, but don’t go overboard.
Also, pay close attention to your “throughput” setting on your dispatcher. This defines thread distribution “fairness” in your dispatcher - telling the actors how many messages to handle in their mailboxes before relinquishing the thread so that other actors do not starve. However, a context switch in CPU caches is likely each time actors are assigned threads, and warmed caches are one of your biggest friends for high performance. It may behoove you to be less fair so that you can handle quite a few messages consecutively before releasing it.
There are a few edge cases. If you have a case where the number of threads is equal to the number of actors using the dispatcher, set the number extremely high, like 1000. If your actors perform tasks that will take some time to complete and you need fairness to avoid starvation of other actors sharing the pool, set the throughput to 1. For general usage, start with the default value of 5 and tune this value for each dispatcher so that you get reasonable performance characteristics without the risk of making actors wait too long to handle messages in their mailboxes.
For more information, see this page of the documentation on finding bottlenecks in Akka. Hope that helps you with tuning your Akka actors, and happy hakking!