The New Cluster Metrics-Aware Adaptive Load-Balancing Routers and The Cluster Metrics API

Finally with the release of Akka 2.2-M1, one of the most notable features in Akka Cluster becomes available - metrics-based adaptive cluster-aware routers. While cluster-aware routers have been available since Akka 2.1, the internal cluster metrics API and it’s initial consumer of this valuable data now become usable.

Building on my previous post, Cluster Health, State and Fault Tolerance, now the fun can begin.

Proactive Fault Tolerance Versus Reactive

Reactive Fault Tolerance

Anyone using Akka is familiar with the Supervisor Strategy. For some, it is the actor’s fault tolerance that led them to initially try Akka, and a primary reason for final adoption. But what does the Supervisor Strategy give us? Reactive fault tolerance within the actor hierarchy.

Preemptive Fault Tolerance

The release of Akka 2.1 saw the addition of the cluster’s AccrualFailureDetector, based on ‘The Phi Accrual Failure Detector' by Hayashibara et al. This was aptly renamed PhiAccrualFailureDetector and refactored to Akka Remote in 2.2.

This FailureDetector strategy employs “a suspicion level on a continuous scale” to express current network conditions and node availability across a cluster. A configurable threshold value, -log10(1 - F(timeSinceLastHeartbeat), and historical heartbeat inter-arrival times all factor into the decision of whether a failure may have occurred. Not to be overlooked is that the failure detection threshold can be tuned to affect suspicion level, the potential for false positives, and time to detect actual crashes, should they occur. This is fault tolerance over distributed members of node rings with preemptive suspicion-level awareness.

Predictive Fault Tolerance

The AdaptiveLoadBalancingRouter and the Cluster Metrics API

Member nodes of the cluster collect system and JVM health metrics which are both gossipped around the node ring and made available to local subscribers. The primary consumers of this data are the load-balancing routers.

Cluster metrics are at present primarily for load-balancing of nodes. The Cluster Metrics API manages metrics sampling at a regular frequency, prepares highly variable data for further analysis by other entities, and gossips the latest cluster metrics data around the node ring. Each node in the cluster also publishes the current snapshot of this data to the event stream for the AdaptiveLoadBalancingRouter to assist in determining the need to redirect outbound traffic to least-loaded nodes.

The AdaptiveLoadBalancingRouter consumes the latest cluster metrics and pipes the data through particular MetricsSelectors to create a pool of weighted routees based on remaining capacity for message routing. Nodes exhibiting lower load have a higher probability of being routed to than nodes under higher load.

The abstract CapacityMetricsSelector calculates weights from capacity, and is the basis for the following metrics selector implementations:

  • HeapMetricsSelector calculates capacity of remaining JVM heap memory: (max - used) / max
  • CpuMetricsSelector calculates remaining CPU capacity based on the sum of User + Sys + Nice + Wait: 1 - utilization
  • SystemLoadAverageMetricsSelector calculates remaining system load capacity. If the average is nearing the number of cpus/cores it is indicative of possible bottlenecks: 1 - (load / processors)
  • MixMetricsSelector aggregates capacity of heap, cpu and load to calculate combined capacity
  • A custom akka.cluster.routing.MetricsSelector can also be created

The Cluster Metrics API

The default metrics sampling is delegated to a MetricsCollector using JMX MBeans. However you can optionally employ the Hyperic System Information Gatherer (SIGAR) MetricsCollector for a broader range of metrics with more precise values. The SIGAR collector is only utilized if both the SIGAR lib and its appropriate libsigar-{processor-type}-{operating-system}.dylib companion are detected on the classpath. The SIGAR collector is based on the JMX collector to still provide data points for JVM heap memory used, committed, and max (if available for the specific OS). SIGAR also returns the combined CPU metric, i.e. the sum of User + Sys + Nice + Wait. This metric describes the amount of time the CPU spent executing code during n-interval and how much more it could theoretically. Note that 99% CPU utilization can be optimal or indicative of failure.

To add SIGAR, download and add the SIGAR jar: java -classpath sigar-1.6.5.132.jar. Then add the appropriate native library using -Djava.libarary.path.

Smoothing of the data for each monitored process is delegated to an exponentially weighted moving average (EWMA) function. Such an approach captures short-term movements in volatility for a conditional volatility forecasting model. The decay factor provides a statistical streaming data model that is exponentially biased towards newer entries. Unlike a standard moving average model, EWMA only needs the most recent forecast value to be kept, which allows this to be both lighter weight and more efficient.

How To Leverage The Functionality

Subscribing to Metrics Events

Subscribing to cluster metrics events directly allows easy access to the data in order to implement custom event-based logic.

Custom Metrics Collector

Should you need to roll your own metrics collector versus using the default akka.cluster.JmxMetricsCollector or akka.cluster.SigarMetricsCollector, start by looking at these two. Then simply configure your class FQCN.

Parting Thoughts

For application engineers the AdaptiveLoadBalancingRouter and the Cluster Metrics API cover two primary concerns: predictive, proactive fault tolerance and increased performance.

For cloud engineers, stop and consider the potential for capacity planning that could be derived from this statistical data. A machine learning kernel for cloud. Now consider linking this real-time data to EC2, VMware cloud, and other APIs.

Wouldn’t it be nice to automate spinning up new nodes when bottlenecks arise in your Data Centers when VMs near their capacity? Watching it all unfold on the Typesafe Console.

Helena Edelson
Twitter: @helenaedelson

Recent comments

Blog comments powered by Disqus