Long Running Processes with Event Sourcing and CQRS
In this post I want to go over what I learned about long running processes when building a project using Event Sourcing and Command Query Responsibility Segregation (CQRS) pattern.
Update: In an attempt to remove confusion, I have reverted to using the proper term “Process Manager” instead of “Saga”. The latter, should be used for something else.
Introduction
In this post we will look at how to implement a long running process in the context of Event Sourcing and CQRS. We will make use of a Process Manager and explain how it solves some common problems with long running processes.
The Problem
How do you handle a long running process when using Event Sourcing?
Example: You would like to run a complex data mining algorithm that can potentially take several minutes, if not hours.
The problem is more complex than at first seems. All of the following concerns have to be addressed:
- How does a long running process fit into the model of CQRS and Event Sourcing?
- How do you handle failure of a long running process?
- How do you handle recovery from system failure? (e.g. power failure)
The Tools
Before we go on to the solution, let’s first define what CQRS and Process Manager are.
CQRS
Command Query Responsibility Segregation is a principle that states that a system should be split in two: one responsible for accepting commands and another for accepting queries.
Typical CQRS architecture is shown below.
The important things to note are:
- The command side is responsible only for validating commands and executing them on an aggregate.
- Aggregate is responsible for keeping a consistent system state.
- All changes to an aggregate are expressed as events.
- Command side can only reply with ok or error, as it doesn’t know anything about the actual events that will be generated by aggregates.
- All events that are generated by the aggregates eventually propagate, via the bus, to the query side.
- Query side is responsible only for accepting user queries and executing them against some storage.
- Query side storage is completely determined by the events it receives from the command side.
Process Manager
The definition of the Process Manager from the system’s point of view is simple:
Process Manager is a machine that takes events as input and produces commands as output.
We must understand that how it obtains those events and produces the commands is irrelevant here. Some of the ways a process manager can obtain events might be: from the central event bus, from another system, from the newscast on TV or from the relational database. What is relevant is that the Process Manager cannot reject those events, as they represent a fact, something that happened. Some of the ways in which a Process Manager can produce commands are: issue a command to an aggregate, actuate some physical device (like an alarm or a pump), activate a loudspeaker and play an audible command.
Now that the concept of a Process Manager is clear we can proceed to examine where it fits in a typical CQRS architecture.
Process Manager and the CQRS
The following diagram depicts a CQRS system with a Process Manager.
Since Process Manager needs to be able to issue commands, it has to have access to the Command Side.
- Because it is responsible for reacting to certain events, it must have access to the Bus.
- Finally, it needs access to the Query Side to restore its own state.
The last requirement— access to the Query Side — will become clearer as we go through the solutions to the problems we outlined in the beginning of this post.
Process Manager serves a different role than an aggregate or a projection. I find the best way to think of it is as a projection with private state that can issue commands.
The Solution
Employ a Process Manager that recovers its state from the query side and issues commands to the command side.
In more detail, let us examine how Process Manager solves each problem that we are faced with when encountering long running processes.
1. How does a long running process fit into the model of CQRS and Event Sourcing?
Long running process can be implemented using a component that reacts to events to issue one or more commands, which in turn may produce more events. This component is the Process Manager.
Let us consider an example of a fictitious nuclear reactor:
- News broadcast is received stating that an earthquake is imminent.
- An event is stored in the system of type earthquake_warning_received.
- Process Manager receives the earthquake_warning_received event and issues a command to turn off the nuclear reactor.
- Something goes wrong with a Reactor #4, and an event of type reactor_shutdown_malfunction(4) is logged.
- Process Manager receives the reactor_shutdown_malfunction(4) event and issues a command to send an email message to the system engineer.
- System engineer manually overrides Reactor #4 and manages to shut it down.
- An event of type reactor_down(4) is produced in the system.
- Process Manager receives the reactor_down(4) event and does not issue any further commands.
In this case we can see that Process Manager does not really care about anything else but the events that it receives. It does not try to be smart about issuing commands, but rather does something and waits to observe its effect.
In the model of CQRS, our long running process is represented by the Process Manager receiving multiple events and issuing multiple commands, until it settles at some steady state. Event Sourcing serves to record all the effects of commands in the system.
2. How do you handle failure of a long running process?
Failure of a long running process is to be treated as part of the process itself.
Process Manager does not know what failure is, as failure is just another event in the long series of other events, and results in zero or more commands.
As we saw in the example above, the failure to shut down a certain reactor resulted in a new event in the system, which was later itself handled by the Process Manager. Process Manager handled that failure by issuing yet another command.
3. How do you handle recovery from system failure? (e.g. power failure)
Long running process should treat system recovery and startup as equivalent processes: system should be restored to the last known state and continue to operate normally.
When a system experiences unexpected failure, the startup procedure should be no different than from “expected” failure (if such a thing exists). Process Manager should read the last known state from the Query side and start issuing commands if necessary.
Taking our nuclear reactor example above, imagine that the computer responsible for nuclear reactor experiences power failure and reboots right after Step #3:
Process Manager receives the earthquake_warning_received event and issues a command to turn off the nuclear reactor.
While the computer is rebooting, we miss the following event:
Something goes wrong with a Reactor #4, and an event of type reactor_shutdown_malfunction(4) is logged.
The recovery procedure should be as follows:
- Inspect the state of the reactor for failures.
- See that the Reactor #4 has malfunctioned during a shutdown… issue a command to send an email message to the system engineer.
- System engineer manually overrides Reactor #4 and manages to shut it down.
- An event of type reactor_down(4) is produced in the system.
- Process Manager receives the reactor_down(4) event and does not issue any further commands.
We can see that steps 3–5 are the same as steps 6–8 in the original example. So it is that we brought about the same results despite experiencing system failure.
Bonus
You can also recover the state of the Process Manager from events in the Event Store.
Conclusion
Long running processes can be implemented using Process Managers when designing your systems with CQRS and Event Sourcing. Process Managers react to events and issue commands. The state of the Process Manager can be recovered from the projections or from event store itself. Once recovered, Process Manager can continue its normal functionality. System recovery and startup should be treated the same.
Good luck with your system building!
References
- Jonathan Oliver on Sagas: http://blog.jonathanoliver.com/cqrs-sagas-with-event-sourcing-part-i-of-ii/
- George Candea et. al., Crash-Only Software, 2003, Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems. https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf
- Andreas Knöpfel et. al., Fundamental Modeling Concepts: Effective Communication of IT Systems, 2006.