In this post I’m going to show a pattern that can be used to discover facts about an actor system while it is running. It can be used to understand how messages flow through the actors in the system. The main reason why I built this pattern is to understand what is going on in a running actor system that is distributed across many machines. If I can’t picture it, I can’t understand it (and I’m in good company with that quote :)
Building actor systems is fun but debugging them can be difficult, you mostly end up browsing through many log files on several machines to find out what’s going on. I’m sure you have browsed through logs and thought, “Hey, where did that message go?”, “Why did this message cause that effect” or “Why did this actor never get a message?”
This is where the Spider pattern comes in.
The below figure shows a (hypothetical) actor system that processes images from highway camera’s. The system tries to recognize different types of objects in the images. The figure shows the possible flows through the system.
When debugging this system, you will want to get information about the performance of the actors. If an image did not result in some recognized data you want to find out what went wrong and where. While the system is in production it is also good to know that everything still runs according to performance requirements. In many cases some external event creates a message and sends it through actor systems across several machines in some shape or form. The actors in the system do something based on this message. The actors receive messages that are related to the event, do some processing, send messages to each others about the event and create new messages as part of the flow.
The easiest way to get an understanding of how the messages flow through the total system is to give every event a unique ID, and put that ID in every message. Simple tracing of message flows. Whenever an actor receives a message, it logs the ID, time, name of the actor and some contextual information to a file with the
ActorLogging trait. Based on the information in the log files you can recreate a picture of how messages flow through the system, how long every actor took and so on. That gives you some information, but it is still quite hard to trace a flow of messages in log files across many machines. There is always the option of processing the log files, but you don’t want to download all the logs just to get the information about a couple of messages.
What you would really want is some way to ask the system where messages went, which actors were involved, and what was the cause of the behavior of the system. Maybe you would like to build a picture of the active flows in the system, which actors where involved, which servers where used and how long everything took. The below figure is an example of a picture that you might like to be able to build:
- I have an ID of a message flow, and now I want to know which actors were involved. I know at least one
ActorPathto an actor in the system that was part of the flow. The pattern must provide enough information to build up a picture of how the actors are connected and how the messages flow through them.
- I want to be able to run some kind of diagnostics on all the actors that took part in the flow and collect results. Information like the time an actor takes to process the message.
- I’m sure I’ll have other requirements once I’ve got that down, so a generic solution would be great. It should be possible to build other use cases on top of the pattern. Generically, I want to discover meaningful pathways in my actor systems.
Train of thought
- From which actors is an actor receiving messages?
- To which actors is an actor sending messages?
- Which actors where created by an actor? (Since the actor might send messages to these later)
We need to attach to every Actor in the system and find out these three facts.
Once we know the causing actor and effected actors of every actor we can build up a network across which the messages must have travelled. We can (temporarily) keep information at every node in the network, which we can retrieve later.
The pattern uses the concept of spiders on a web. Web nodes are attached to every actor, and intercept incoming and outgoing messages. The web is built by the actors communicating with each other, since the web nodes are eavesdropping what every actor receives and sends. The web nodes keep track of who communicated with who, and builds up a web. Whenever a message is sent, received, or an actor created, pathways in the web are expanded.
The web nodes should stick to the actors, and communicate through the actor references of the actors without the actors noticing it. When a web node gets a request, it can collect the data that it built up while the actor was sending messages and forwards the request to all paths in and out of the actor, reaching every web node attached to every actor. These web nodes in turn collect their data and broadcast the message, until all web nodes are reached. Spiders should crawl this web, collect data and send this data back to some location (lets call it a spider home).
Lets look at the code for this pattern. Because there is no way to automatically swap out
ActorRefs on the actors in the network, We will need to take some considerations when building the actors. (the
ActorRef class has a private
self type, so you can’t extend it outside of Akka. The
ActorContext has some protected methods you cannot delegate for outside of Akka as well) We can’t use the
Actor methods directly because we want the web nodes to intercept messages. So the first thing to build is a little trait that sort of isolates the actor from the ‘direct’ Akka API and provides methods for sending messages and creating actors:
This trait provides a couple of methods that we will use from out Actors instead of the direct Akka API. The above implementation just forwards to the API without any changes.
Now for the
WebNode is built as a Stackable Trait which can be added to an Actor as we will see a little later. Basically it replaces the
receive partial function of an
Actor wit a receive that wraps the original
receive of the actor. This is how we are going to plug into the actor, detect senders and receivers of messages and keep track of the pathways through the systems.
WebNode overrides the
receive PartialFunction of the
Actor that it is stacked on, by first handling any requests that the
webnode should handle and otherwise wrapping the original receive of the
Actor with the
wrappedReceive PartialFunction. The
Node so that it can plug into the methods for sending and creating actors when these are used from the actor.
Lets first look at
wrappedReceive. What is basically does is that it records who was the sender of the message (the cause of the message), allows for before and after hooks, and calls the
super.receive method so that the
Actor can do it’s work as usual. The cool thing about Akka is that if messages are forwarded through routers the original sender is passed along so we can even look through Akka routers or anything that keeps the original sender. Both the input
ActorRefs and the output
ActorRefs of the actor are recorded for later. This works because
ActorRefs are immutable and always point to the correct
As you can see, the
wrappedReceive only allows for normal processing when the request is not of a special type, (a
Tuple of Request and Spider), which we will get into later. Most importantly, the
WebNode records inputs and the outputs of the actor. The
handleRequest makes it possible for the
WebNode to handle special requests through the
ActorRef of the
Actor it is attached to. The
WebNode[Data, Request] is generic and allows for you to plugin any type of
Request for which some type of
Data needs to be returned.
Spiders have to be able to travel the
WebNodes to collect data in the web. The
(Request, Spider) messages for this purpose.
handleRequest calls a
collect method which you can implement to provide data back to the ‘spider home’. The sendSpiders method is called if some data could be collected for the request. The default implementation of sendSpiders just sends new spiders further up the actors in the network, using the pathways that where collected. (the check on the unique ID is there to prevent the same request to be handled more than once per
The above gist shows the
Spider and the
Spider basically has a home
ActorRef to which it can report the data it collects on the web. It also keeps track of the
WebTrail it has followed on the web so it does not go to the same actor twice. The
WebNodeRef is a collection of references that gives all the information about the visited node in the actor network. The node itself, all the actors that provided input to the actor (in), and all the actors that the actor sent messages to (out).
So lets see how we can build some diagnostics on top of this generic
Diagnostics trait extends from the
WebNode, and overrides the
sendSpiders method. Here it sends
DiagnosticsData back to the spider home. The
DiagnosticData is a generic case class that can be used for many kinds of diagnostics. Lets implement a timing diagnostics:
TimingDiagnostics makes some concrete choices in how the request for diagnostics should provide data back to the spider home. It uses the hooks
diagnoseAfter from the
Diagnostics trait to keep some state about the time it took for every message that has an Id, marked by the
HasId trait. Since the processing is sequential in an
Actor it is completely safe to expect the before and after to happen in pairs. The collect method is implemented so that it returns a tuple, respectively of the ID of the message, and the time that the handling of the message took (in nano seconds, a detail that can be found in the
Diagnostics trait, where
now is implemented.)
Lets write a test to see how everything works. In the test we will use another handy trait to help with testing, a WireTap:
WireTap is also a Stackable Trait, which basically sends the message that the actor received to a listener after the actor is finished. This is handy when you have an actor that does not send anything out but you do want to know that it is finished. For the test we will build some very simple actors:
We’ve got two actors, a
Transformer, which sends messages to a next actor, and a
Printer, which prints to the screen. The transformer does not use the Akka API directly but uses the
Node trait. If the transformer is used as-is, the
Node trait sends messages using the Akka API as normal. Lets see how we now can build an actor system and add some diagnostics to it.
Transformer is stacked with a
TimingDiagnostics. I’ve schematically shown the structure of the ActorSystem that we build in this test in the comments. As you can see I’m using a broadcast router to check if the forwarding through routers also works. All the Actors are fitted with a
TimingDiagostics trait. From there we send the first actor in the system a message. We can expect a message with the
TestKit, since we have fitted the
Printer with a
WireTap. We give the
testActor as a listener, which is how the
TestKit receives the messages that the
Printer processes. As you can see you can stack the traits on top of each other (hence the name) and stack both the
TimingDiagnostics and the
WireTap on the
Printer actor, combining both behaviors of the traits.
After a message is sent to the system a Promise is created which is completed in an actor on the returnAddress (what I’ve been referring to as the spider home). Now that we have set up the home
ActorRef that all diagnostics data can be sent to, We send a diagnostics request to one of the actors in the system that was fitted with a
TimingDiagnostics WebNode. We wait for the future to complete in the spider home actor and check the results of the diagnostics data. (I’m cheating a little bit in the receive of the return address / ‘spider home actor’, since I know that I should get 6 responses. In a real system you would just handle this as a stream of messages, and work with what you have at a certain moment in time)
When you run the test the above would be some typical output. A SomeMessage with ID 1 was sent and timing diagnostics where requested for this ‘event ID’. You can see the id and the time in nanoseconds it took in the tuple, the timestamp when the diagnosis was taken, and a
WebNodeRef for every actor in the message flow with event ID 1.
I must say I have really enjoyed all the posts in this Summer of Blog series. To follow the tradition set by Derek Wyatt, it’s time to kill some Actors!! (I’m sure this will get me bonus points with some of the team members ;) Lets build some venomous spiders that kill of any actors that are just too d*mn lazy:
Once again we build a trait that we can stack onto an Actor. In this case we provide some Venom. As you can see, the
Venom WebNode responds to
KillSlowActors requests. While the actors are running, it is (once again) registering the time the actors take. The collect method now checks the time and if any actor takes longer than a seconds, he’s a goner. Polite as these venomous spiders are they first send a message back to the spider home containing the actors they killed. Then they send the actor that they are mixed in with a PoisonPill. Sayonara!
The Spider pattern provides a generic way to discover pathways and collect information about the messages and actors across actor systems. Since the ActorRefs are used throughout Akka, this also works for remote systems (which was not shown in this test, something for another day). There are many more use cases than I’ve shown here. You could collect data to find out which pathways are the most successful in a load-balanced system across many servers. You could create graphical representations of how messages flow through your system. You could trace the cause and the effects of actors. You could initiate a supervision strategy by forcing particular exceptions in the Actors that are connected to the WebNodes. All of the code is up on github github.com/RayRoestenburg. Check out the ‘com.ray.spider’ package for the examples shown here.
About the Author
Raymond Roestenburg works at Xebia as a software architect, polyglot programmer and software craftsman, specializing in Scala and Akka. He has enjoyed working with Akka since the early days of version 0.7, and has recently contributed to the rewrite of the Akka-Camel module that integrates Apache Camel and Akka Actors, which will be released in the Akka 2.1 release. Raymond is also the author of an upcoming book on Akka (working title ‘Akka in Action’) to be published by Manning Publications.