Discovering message flows in actor systems with the Spider Pattern

In this post I’m going to show a pattern that can be used to discover facts about an actor system while it is running. It can be used to understand how messages flow through the actors in the system. The main reason why I built this pattern is to understand what is going on in a running actor system that is distributed across many machines. If I can’t picture it, I can’t understand it (and I’m in good company with that quote :)

Building actor systems is fun but debugging them can be difficult, you mostly end up browsing through many log files on several machines to find out what’s going on. I’m sure you have browsed through logs and thought, “Hey, where did that message go?”, “Why did this message cause that effect” or “Why did this actor never get a message?”

This is where the Spider pattern comes in.

The below figure shows a (hypothetical) actor system that processes images from highway camera’s. The system tries to recognize different types of objects in the images. The figure shows the possible flows through the system.

Example flow

When debugging this system, you will want to get information about the performance of the actors. If an image did not result in some recognized data you want to find out what went wrong and where. While the system is in production it is also good to know that everything still runs according to performance requirements. In many cases some external event creates a message and sends it through actor systems across several machines in some shape or form. The actors in the system do something based on this message. The actors receive messages that are related to the event, do some processing, send messages to each others about the event and create new messages as part of the flow.

The easiest way to get an understanding of how the messages flow through the total system is to give every event a unique ID, and put that ID in every message. Simple tracing of message flows. Whenever an actor receives a message, it logs the ID, time, name of the actor and some contextual information to a file with the ActorLogging trait. Based on the information in the log files you can recreate a picture of how messages flow through the system, how long every actor took and so on. That gives you some information, but it is still quite hard to trace a flow of messages in log files across many machines. There is always the option of processing the log files, but you don’t want to download all the logs just to get the information about a couple of messages.

What you would really want is some way to ask the system where messages went, which actors were involved, and what was the cause of the behavior of the system. Maybe you would like to build a picture of the active flows in the system, which  actors where involved, which servers where used and how long everything took. The below figure is an example of a picture that you might like to be able to build:

Message flow across Actor Systems
Ideally you should be able to take some kind of action as well based on the information you get back. In the above example maybe you want to restart actor ‘c’ because the network connection to actor ‘z’ is very slow and you would rather switch sending to another server. Let’s see what we can do to build a pattern that provides a bases for this type of functionality.

Requirements

The requirements for this pattern are the following:
  • I have an ID of a message flow, and now I want to know which actors were involved. I know at least one ActorPath to an actor in the system that was part of the flow. The pattern must provide enough information to build up a picture of how the actors are connected and how the messages flow through them.
  • I want to be able to run some kind of diagnostics on all the actors that took part in the flow and collect results. Information like the time an actor takes to process the message.
  • I’m sure I’ll have other requirements once I’ve got that down, so a generic solution would be great. It should be possible to build other use cases on top of the pattern. Generically, I want to discover meaningful pathways in my actor systems.

Train of thought

How are we going to discover the dynamic networks that are formed between actors? There could be all kinds of routers, balancers, broadcasters between the actors that forward messages. Well, actors have four core operations:
  1. Create
  2. Send
  3. Become
  4. Supervise
An actor system is basically a network of nodes with these operations. Every node processes messages sequentially. If you look at an actor system in terms of cause and effect, the cause is a message that is sent to the actor, the effect is that the actor sends out messages to other actors. From the perspective of discovering messaging networks, only creating and sending is relevant. (the other two might be interesting for a later post)
Cause and effect
If we want to discover how messages are sent in the network of actors, we need to know a couple of things:
  1. From which actors is an actor receiving messages?
  2. To which actors is an actor sending messages?
  3. Which actors where created by an actor? (Since the actor might send messages to these later)

We need to attach to every Actor in the system and find out these three facts.

Once we know the causing actor and effected actors of every actor we can build up a network across which the messages must have travelled. We can (temporarily) keep information at every node in the network, which we can retrieve later.

Pattern overview

The pattern uses the concept of spiders on a web. Web nodes are attached to every actor, and intercept incoming and outgoing messages. The web is built by the actors communicating with each other, since the web nodes are eavesdropping what every actor receives and sends. The web nodes keep track of who communicated with who, and builds up a web. Whenever a message is sent, received, or an actor created, pathways in the web are expanded.

spider pattern

Spider pattern second part

The web nodes should stick to the actors, and communicate through the actor references of the actors without the actors noticing it. When a web node gets a request, it can collect the data that it built up while the actor was sending messages and forwards the request to all paths in and out of the actor, reaching every web node attached to every actor. These web nodes in turn collect their data and broadcast the message, until all web nodes are reached. Spiders should crawl this web, collect data and send this data back to some location (lets call it a spider home).

Implementation

Lets look at the code for this pattern. Because there is no way to automatically swap out ActorRefs on the actors in the network, We will need to take some considerations when building the actors. (the ActorRef class has a private self type, so you can’t extend it outside of Akka. The ActorContext has some protected methods you cannot delegate for outside of Akka as well) We can’t use the Actor methods directly because we want the web nodes to intercept messages. So the first thing to build is a little trait that sort of isolates the actor from the ‘direct’ Akka API and provides methods for sending messages and creating actors:

This trait provides a couple of methods that we will use from out Actors instead of the direct Akka API. The above implementation just forwards to the API without any changes.

Now for the WebNode. The WebNode is built as a Stackable Trait which can be added to an Actor as we will see a little later. Basically it replaces the receive partial function of an Actor wit a receive that wraps the original receive of the actor. This is how we are going to plug into the actor, detect senders and receivers of messages and keep track of the pathways through the systems.

The WebNode overrides the receive PartialFunction of the Actor that it is stacked on, by first handling any requests that the webnode should handle and otherwise wrapping the original receive of the Actor with the wrappedReceive PartialFunction. The WebNode extends Node so that it can plug into the methods for sending and creating actors when these are used from the actor.

Lets first look at wrappedReceive. What is basically does is that it records who was the sender of the message (the cause of the message), allows for before and after hooks, and calls the super.receive method so that the Actor can do it’s work as usual. The cool thing about Akka is that if messages are forwarded through routers the original sender is passed along so we can even look through Akka routers or anything that keeps the original sender. Both the input ActorRefs and the output ActorRefs of the actor are recorded for later. This works because ActorRefs are immutable and always point to the correct Actor.


As you can see, the wrappedReceive only allows for normal processing when the request is not of a special type, (a Tuple of Request and Spider), which we will get into later. Most importantly, the WebNode records inputs and the outputs of the actor. The handleRequest makes it possible for the WebNode to handle special requests through the ActorRef of the Actor it is attached to. The WebNode[Data, Request] is generic and allows for you to plugin any type of Request for which some type of Data needs to be returned.

Spiders have to be able to travel the WebNodes to collect data in the web. The WebNode handle (Request, Spider) messages for this purpose. handleRequest calls a collect method which you can implement to provide data back to the ‘spider home’. The sendSpiders method is called if some data could be collected for the request. The default implementation of sendSpiders just sends new spiders further up the actors in the network, using the pathways that where collected. (the check on the unique ID is there to prevent the same request to be handled more than once per WebNode.)

The above gist shows the Spider and the WebNodeRef. The Spider basically has a home ActorRef to which it can report the data it collects on the web. It also keeps track of the WebTrail it has followed on the web so it does not go to the same actor twice. The WebNodeRef is a collection of references that gives all the information about the visited node in the actor network. The node itself, all the actors that provided input to the actor (in), and all the actors that the actor sent messages to (out).

Diagnostics

So lets see how we can build some diagnostics on top of this generic WebNode:

The Diagnostics trait extends from the WebNode, and overrides the sendSpiders method. Here it sends DiagnosticsData back to the spider home. The DiagnosticData is a generic case class that can be used for many kinds of diagnostics. Lets implement a timing diagnostics:

The TimingDiagnostics makes some concrete choices in how the request for diagnostics should provide data back to the spider home. It uses the hooks diagnoseBefore and diagnoseAfter from the Diagnostics trait to keep some state about the time it took for every message that has an Id, marked by the HasId trait. Since the processing is sequential in an Actor it is completely safe to expect the before and after to happen in pairs. The collect method is implemented so that it returns a tuple, respectively of the ID of the message, and the time that the handling of the message took (in nano seconds, a detail that can be found in the Diagnostics trait, where now is implemented.)

Testing

Lets write a test to see how everything works. In the test we will use another handy trait to help with testing, a WireTap:

The WireTap is also a Stackable Trait, which basically sends the message that the actor received to a listener after the actor is finished. This is handy when you have an actor that does not send anything out but you do want to know that it is finished. For the test we will build some very simple actors:

We’ve got two actors, a Transformer, which sends messages to a next actor, and a Printer, which prints to the screen. The transformer does not use the Akka API directly but uses the Node trait. If the transformer is used as-is, the Node trait sends messages using the Akka API as normal. Lets see how we now can build an actor system and add some diagnostics to it.

The Transformer is stacked with a TimingDiagnostics. I’ve schematically shown the structure of the ActorSystem that we build in this test in the comments. As you can see I’m using a broadcast router to check if the forwarding through routers also works. All the Actors are fitted with a TimingDiagostics trait. From there we send the first actor in the system a message. We can expect a message with the TestKit, since we have fitted the Printer with a WireTap. We give the WireTap the testActor as a listener, which is how the TestKit receives the messages that the Printer processes. As you can see you can stack the traits on top of each other (hence the name) and stack both the TimingDiagnostics and the WireTap on the Printer actor, combining both behaviors of the traits.

After a message is sent to the system a Promise is created which is completed in an actor on the returnAddress (what I’ve been referring to as the spider home). Now that we have set up the home ActorRef that all diagnostics data can be sent to, We send a diagnostics request to one of the actors in the system that was fitted with a TimingDiagnostics WebNode. We wait for the future to complete in the spider home actor and check the results of the diagnostics data. (I’m cheating a little bit in the receive of the return address / ‘spider home actor’, since I know that I should get 6 responses. In a real system you would just handle this as a stream of messages, and work with what you have at a certain moment in time)

When you run the test the above would be some typical output. A SomeMessage with ID 1 was sent and timing diagnostics where requested for this ‘event ID’. You can see the id and the time in nanoseconds it took in the tuple, the timestamp when the diagnosis was taken, and a WebNodeRef for every actor in the message flow with event ID 1.

I must say I have really enjoyed all the posts in this Summer of Blog series. To follow the tradition set by Derek Wyatt, it’s time to kill some Actors!! (I’m sure this will get me bonus points with some of the team members ;) Lets build some venomous spiders that kill of any actors that are just too d*mn lazy:

Once again we build a trait that we can stack onto an Actor. In this case we provide some Venom. As you can see, the Venom WebNode responds to KillSlowActors requests. While the actors are running, it is (once again) registering the time the actors take. The collect method now checks the time and if any actor takes longer than a seconds, he’s a goner. Polite as these venomous spiders are they first send a message back to the spider home containing the actors they killed. Then they send the actor that they are mixed in with a PoisonPill. Sayonara!

Conclusion

The Spider pattern provides a generic way to discover pathways and collect information about the messages and actors across actor systems. Since the ActorRefs are used throughout Akka, this also works for remote systems (which was not shown in this test, something for another day). There are many more use cases than I’ve shown here. You could collect data to find out which pathways are the most successful in a load-balanced system across many servers. You could create graphical representations of how messages flow through your system. You could trace the cause and the effects of actors. You could initiate a supervision strategy by forcing particular exceptions in the Actors that are connected to the WebNodes. All of the code is up on github github.com/RayRoestenburg. Check out the ‘com.ray.spider’ package for the examples shown here.

About the Author

Raymond Roestenburg works at Xebia as a software architect, polyglot programmer and software craftsman, specializing in Scala and Akka. He has enjoyed working with Akka since the early days of version 0.7, and has recently contributed to the rewrite of the Akka-Camel module that integrates Apache Camel and Akka Actors, which will be released in the Akka 2.1 release. Raymond is also the author of an upcoming book on Akka (working title ‘Akka in Action’) to be published by Manning Publications.

Recent comments

Blog comments powered by Disqus

4 Notes

  1. john-langley reblogged this from hakkers
  2. Raymond Roestenburg submitted this to hakkers