Understanding Saga Pattern

Yesterday evening in a casual talk I was discussing some technologies, I came across a new Pattern to me called 'Saga Pattern". Then I went through in the night and thought of putting across my understandings for my reference.
This pattern mainly supported for Microservice implementations. If you are about to implement Microservices that implements a transaction that spans multiple services, this Saga pattern help you to solve the problem you face.

If you are building a Travel booking App using Microservices;


In the above diagram, a high-level business action (‘Booking a trip’) involves making several low-level actions to distinct microservices.

To handle client requests, we create a single-purpose edge service (a Backend for Frontend) that provides the logic of composing calls to all the downstream services. At its core, the Travel Orchestrater - Travel Agent Service exposes APIs by composing core functionality provided by different Microservices.

We all know that the Transactions are an essential part of software applications. Without them, it would be impossible to maintain data consistency. However, in Microservices dont have a single source of truth. State is spread across distinct services each with its own data store.

If all of the service calls completed successfully, great! But in the real world, failures occur on a regular basis. How do we handle partial executions - when a subset of requests in the high-level action failed?

With the Microservices approach, you can’t just book a flight, a car, and a hotel in a single ACID transaction. To do it consistently, you would be required to create a distributed transaction.

Lets think about the scenario, where if the flight reservation failed, would you like to keep the hotel and car? At least I would not.

In this situation, we would need to implement some ad-hoc concurrency control logic in the edge Travel Agent service to handle any potential failures and try to recover from it in some way (Do we cancel the other bookings? What if the flight booking retried and succeeded later on? How do we know what state we are in?)

Without some kind of concurrency control mechanism, we risk having inconsistent data in our application - which is especially bad in distributed systems. Eventually, this control logic can get very complicated. Dealing with partial failures and asynchrony is hard… And that’s where distributed sagas come in.

In distributed systems, business transactions spanning multiple services require a mechanism to ensure data consistency across services. The Distributed Saga pattern is a pattern for managing failures, where each action has a compensating action for rollback. Distributed Sagas help ensure consistency and correctness across microservices.

A Request has a corresponding Compensating Request that semantically undoes the Request. CancelHotel undoes BookHotel, CancelFlight undoes BookFlight and so on.

Note that certain actions are not undo-able in the conventional sense. An email that was sent to the wrong recipient cannot be un-sent. However, we can  semantically undo the action by sending another email that says ‘Sorry, please ignore the previous email.

Compensating Requests semantically undoes a Request by restoring the application’s state to the original state of equilibrium before the Request was made.

This below section I have borrowed from one of the articles somewhere...

Distributed Saga Guarantee
Amazingly, a distributed saga guarantees one of the following two outcomes:

Either all Requests in the Saga are succesfully completed, or
A subset of Requests and their Compensating Requests are executed.
The catch is for distributed sagas to work, both Requests and Compensating Requests need to obey certain characteristics:

Requests and Compensating Requests must be idempotent, because the same message may be delivered more than once. However many times the same idempotent request is sent, the resulting outcome must be the same. An example of an idempotent operation is an UPDATE operation. An example of an operation that is NOT idempotent is a CREATE operation that generates a new id every time.
Compensating Requests must be commutative, because messages can arrive in order. In the context of a distributed saga, it’s possible that a Compensating Request arrives before its corresponding Request. If a BookHotel completes after CancelHotel, we should still arrive at a cancelled hotel booking (not re-create the booking!)
Requests can abort, which triggers a Compensating Request. Compensating Requests CANNOT abort, they have to execute to completion no matter what.

Distributed Saga Implementation Approaches
There are a couple of different ways to implement a Saga transaction, but the two most popular are:

Event-driven choreography: When there is no central coordination, each service produces and listen to other service’s events and decides if an action should be taken or not.
Command/Orchestration: When a coordinator service is responsible for centralizing the saga’s decision making and sequencing business logic.
In this guide, we’ll look at the latter. With the orchestration approach, we define a new Saga Execution Coordinator service whose sole responsibility is to manage a workflow and invoke downstream services when it needs to.

Saga Execution Coordinator
The Saga Execution Coordinator is an orchestration service that:

Stores & interprets a Saga’s state machine
Executes the Requests of a Saga by talking to other services
Handles failure recovery by executing Compensating Requests

In Closing
I learned about what Distributed Sagas are (a pattern for handling failure in Microservices) and how it solves the problem of distributed transactions. It helps ensure correctness and consistency in Microservices.

We also learned about Saga Execution Coordinators, and how you can get started with defining state machines using the AWS States Language and AWS Step Functions.

Thank you for reading!

Comments

Popular posts from this blog

Interview Questions to Ask the Employer

Place .NET DLL in GAC

Windows Communication Foundation - FAQ