Status Report of Akka Cluster
With the remoting in Akka 2.0 you can build great distributed actor systems, but you need to manage deployment locations, host names and ports yourself. When looking up an actor on a remote node with actorFor you need to know the host and port.
One of the primary goals with Akka Cluster is to let the cluster manage the remote locations and you can interact with one location transparent actor tree representing all actors in the cluster, instead of one tree per node.
Another important thing that Akka Cluster will solve is that actors will be able to move. Once you obtained a reference to the actor you can interact with it even though the cluster decides to move it from one host to another.
This isn’t a trivial problem to solve. We have started from the ground up, with two core features of the cluster. We have developed cluster membership and failure detection and it’s in a good shape, so that we can start building other things on top of it.
Cluster membership is all about keeping track of what nodes that participate in the cluster and what state they are in. We use a decentralized peer-to-peer cluster and membership is communicated using a Gossip Protocol.
When all nodes have seen the same cluster state one of the members, the leader, can take actions, such as shifting members in and out of the cluster. The leader is not predefined or special in any way. There is no leader election process, the leader can always be recognised deterministically by any node whenever there is gossip convergence. The tasks the leader performs are fairly low intensive, basically just coordinates changes in the cluster membership.
Members in the cluster monitors heartbeat messages from other members to detect if a node is unreachable from the rest of the cluster. For this we have based the implementation on ideas described in the paper The Phi Accrual Failure Detector by Hayashibara et al.
You can find more detailed description of the cluster membership, gossip, and failure detection in the Akka Cluster Specification.
It would not be possible to deliver these things with good quality without proper testing infrastructure. Therefore we have put a lot of effort into developing tools to facilitate testing of distributed Akka actor systems. This is something we will make available for you to test your own remote or cluster Akka systems.
It’s an evolution of the multi-jvm tool to be able to automatically deploy and run the test nodes on different physical machines. Logs and test results are aggregated to one location.
We have developed a testkit for writing those multi-node tests. It solves orchestration of test steps across the nodes, using barriers coordinated by one of the nodes. Various ways of simulating network problems is supported using a special Netty transport.
The same test code can be developed and run locally, and then run automatically on several nodes by a CI server. We use 11 machines to continously run the Akka cluster tests. It has been invaluable for finding bugs.
It’s most illustrative to explain how we write the tests with some code:
In the sample test above you see how we programmatically interact with the Cluster. For real operations we provide a shell script for sending requests and commands to the nodes in the cluster.
We look very much forward to developing next steps, which will be things like:
- Partitioning of actors to the nodes in the cluster
- Naming service that keeps tracks of the location of the actor partitions
- Handoff when an actor is moved from one node to another
- Hook up death watch and supervision with the failure detector
- Cluster aware routers
- Subscription to cluster domain events via event bus
The reason for basing the clustering on a clean-room implementation based on actors and gossip is a well thought through decision and not just a sign of NIH. In the process we have implemented and thrown away prototypes using both ZooKeeper, JGroups and Hazelcast. None of them solved the problem optimally and each one imposed its own set of unnecessary constraints and drawbacks. We strongly believe that a loosely coupled, eventually consistent, fully decentralized and self-coordinating P2P solution is the right way to scale an actor system and to make it fully resilient to failure.
Our vision is that Akka cluster will be elastic and scale from a few nodes up to 100-1000 nodes. We need to test thoroughly before we promise anything. Current implementation has some known limitations but we know how to eliminate each one of them and we will address them one by one as bottlenecks and constraints arise.
If you are interested in contributing the best way to start is to write additional multi-node tests. Look at the existing tests to get started.
Track progress of the Coltrane milestone in Assembla and the Roadmap.
Read more in the Akka Cluster Specification.