Skip to content

Add saga pattern support in Dapr workflow #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

skyao
Copy link
Member

@skyao skyao commented Dec 4, 2023

Related issue is #48

@berndverst
Copy link
Member

While this proposal sounds useful I would like to see the core implementation / common interface, API methods outlined in pseudocode.

It's very hard to interpret Java code and think through how this would work in Python for example.

Could you update the proposal to identify the generic structure that all SDKs must implement?

In this proposal we can then discuss whether this will make sense. After well we want to have a unified implementation across all SDKs, so we will want to agree on the same implementation strategy in this proposal.

artursouza
artursouza previously approved these changes Dec 5, 2023
Copy link
Contributor

@artursouza artursouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please, get approval from other SDK maintainers. It is not a blocker for Java SDK implementation, but it might change based on how this proposal evolves - which is OK since the workflow SDK is version 0.x.

@artursouza
Copy link
Contributor

While this proposal sounds useful I would like to see the core implementation / common interface, API methods outlined in pseudocode.

It's very hard to interpret Java code and think through how this would work in Python for example.

Could you update the proposal to identify the generic structure that all SDKs must implement?

In this proposal we can then discuss whether this will make sense. After well we want to have a unified implementation across all SDKs, so we will want to agree on the same implementation strategy in this proposal.

I agree. I think the implementation detail will evolve as the Java implementation serves as a reference. I am not blocking the Java implementation waiting for this proposal to be merged but knowing the implementation can change (ok for version 0.x of the workflow SDK).

@skyao
Copy link
Member Author

skyao commented Dec 5, 2023

While this proposal sounds useful I would like to see the core implementation / common interface, API methods outlined in pseudocode.

It's very hard to interpret Java code and think through how this would work in Python for example.

Could you update the proposal to identify the generic structure that all SDKs must implement?

In this proposal we can then discuss whether this will make sense. After well we want to have a unified implementation across all SDKs, so we will want to agree on the same implementation strategy in this proposal.

Good suggestion, I will update it soon.

Copy link

@halspang halspang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comment is about the registering of the activity and its compensation separately, I'd prefer to not do that if we can avoid it.

I am a bit worried about having to reimplement this across different SDKs though, specifically around how we do error handling/compensation triggering. If we can register the activity together with its compensation, is it possible to have the runtime trigger it? I think that makes the code cleaning and leaves less implementation details in the SDK.

Chatted with Chris a little bit and it seems like the above isn't really something the runtime is aware of atm in terms of which activity is executing. I'm just curious if we've thought about it/how much work it may be and if it's worth it. In an ideal world I'd love to see the runtime be able to walk back the entire workflow itself but I may be off in terms of what it's capable of right now.

@kaibocai
Copy link

kaibocai commented Dec 8, 2023

Thanks @skyao for this great proposal. I agree with @halspang's comments. Besides that, LGTM!
Just one more comment, in the proposed implementation, if the user enables Saga, we handle the compensation logic for them in the catch block for the SDK, what if the user has some specific logic when an exception happens, in other words, they need to do some work (mostly logging) before trigger compensation. It seems in the current proposal we don't provide a chance for them to do this. They can do it in their catch block but that is just opposed to the proposed solution.

@mukundansundar
Copy link
Contributor

mukundansundar commented Dec 9, 2023

Overall the proposal LGTM.

I agree with both @halspang's and @kaibocai's comments.
Providing an interface which can be implemented by an activity for adding the compensation feature, would be great.
Also moving the logic of calling compensation function into the workflow runtime will make the business logic code more clear.

Should we add another function like compensateOn(Predicate<T Result> result), where if an exception is not thrown, and the compensation must be triggered due to a null Result or some value in result, the Predicate function can be used?

Additionally, should a hook be provided so that users can look into the error/exception if thrown before the compensation is triggered i.e. onError() is called before compensate() is called?

@olitomlinson
Copy link

I would love to see first-class support for Sagas in Workflows across all languages.

FWIW I had the same idea a good few months ago (which was based on the same idea I had of using Sagas for Azure Durable Functions 2 years ago) so it's nice to see that you've arrived in a place along the same line as where I was thinking!

@shubham1172
Copy link
Member

The proposal LGTM overall, thanks @skyao. Agree with the comments here, specifically (1) not having to register a compensation separately, and (2) if we can offload certain things to runtime, to avoid having to duplicate this to all SDKs and maintain them.

@skyao
Copy link
Member Author

skyao commented Dec 14, 2023

I would love to see first-class support for Sagas in Workflows across all languages.

We'll start with java, and hopefully this proposal will be accepted soon so that the implementation of saga pattern in the dapr java sdk can be released in the dapr v1.13. Then I plan to add python and .net support in next dapr release v1.14.

Chris has agreed to do some optimization work in workflow for saga support, as you see, "first-class support".

This should all happen soon.

@cgillum
Copy link

cgillum commented Dec 15, 2023

Sharing my thoughts on the feedback received so far:

Main comment is about the registering of the activity and its compensation separately, I'd prefer to not do that if we can avoid it.

@halspang (and others) if I understand the concern correctly, this is about reducing boilerplate. While I sympathize with that, I worry about how this can make the saga pattern overly opinionated and less useful. For example, I don't think there will always be a 1:1 relationship between an activity and its compensation. Furthermore, the exact compensation strategy (which activity to call) and parameters may also need to be different for the same activity executed at different points in the workflow. I would prefer that we start by erring on the side of a loosely coupled design.

In an ideal world I'd love to see the runtime be able to walk back the entire workflow itself

I understand the appeal of this idea but have two major concerns:

  1. This would require non-trivial changes in the workflow runtime and I'm not convinced that the benefit justifies the cost at this point, especially given that this is a new feature that hasn't been developer tested. Though redundant, an SDK implementation will be cheaper to build, maintain, and change in the future based on user feedback.

  2. This could necessarily make the compensation model even more inflexible because the runtime will have to make specific assumptions about when and how to invoke compensation logic.

I'll also point out that the activity registration verbosity problem can be solved in other ways, such as what @mukundansundar has proposed in the Python SDK. In other words, we may be able to treat these concerns separately.

My overarching hope is that the saga implementation can be loosely coupled to the overall SDK logic, especially given that it's a brand-new thing that we want to get more real-world feedback on.

@cgillum
Copy link

cgillum commented Dec 15, 2023

if the user enables Saga, we handle the compensation logic for them in the catch block for the SDK, what if the user has some specific logic when an exception happens, in other words, they need to do some work (mostly logging) before trigger compensation. It seems in the current proposal we don't provide a chance for them to do this. They can do it in their catch block but that is just opposed to the proposed solution.

@kaibocai I agree that the ability to have some custom logic is important, which is partly why I strongly hesitate against tightly coupling an activity's execution to its compensation. In the proposed model, which is loosely coupled, developers can execute custom logic (e.g., logging) by catching exceptions, logging in the catch block, and then rethrowing the exception (or use manual compensation) to trigger the compensation. I don't think this approach is opposed to the proposed solution. I think the auto-compensation done by the SDK should be seen as a convenience feature.

@cgillum
Copy link

cgillum commented Dec 15, 2023

Additionally, should a hook be provided so that users can look into the error/exception if thrown before the compensation is triggered i.e. onError() is called before compensate() is called?

@mukundansundar similar to my previous responses, this isn't necessary with the current proposal because you can control if/when the compensation gets triggered. These hooks that you're suggesting are only required if we tightly couple an activity invocation with its compensation logic, which I'm arguing we shouldn't do because it creates inflexibility (and requires us to complicate the design with behavior customizing hooks, etc.).

@olitomlinson
Copy link

Having thought this through, I agree with @cgillum comments.

Tightly coupling the compensation action to the Activity is very opinionated.

Let's assume that given a customers use-case, the opinionated model worked fine for a while, but then the user wished to change the Workflow to do something different that doesn't follow the opinionated model, how would they 'break-out' and write a custom compensation just for that one Activity?

@dapr dapr deleted a comment from halspang Dec 18, 2023
@skyao
Copy link
Member Author

skyao commented Dec 18, 2023

@halspang I'm so sorry to delete a comment by mistaken.

This deleted note and the previous one said the same thing, and I replied to them together.

@DeepanshuA
Copy link

DeepanshuA commented Dec 21, 2023

Went through the proposal today. @skyao I think it's a great proposal, that would immensely benefit users.

I too agree with comments regarding keeping registration of compensation separate than activity registration.

@skyao I also wanted to understand a bit more on lines of how compensation will work in different scenarios i.e. An Activity A's compensation may need to be called, if workflow fails at an activity X's level, but Activity A's compensation may NOT be needed to be called if workflow fails at Activity Y's level.
Will there be some rules/ filters that need to be designed accordingly, as a part of infra here OR will user need to explicitly mention in code, depending on language specific error/exception semantics?

@skyao
Copy link
Member Author

skyao commented Jan 2, 2024

@skyao I also wanted to understand a bit more on lines of how compensation will work in different scenarios i.e. An Activity A's compensation may need to be called, if workflow fails at an activity X's level, but Activity A's compensation may NOT be needed to be called if workflow fails at Activity Y's level. Will there be some rules/ filters that need to be designed accordingly, as a part of infra here OR will user need to explicitly mention in code, depending on language specific error/exception semantics?

Currently we have not considered such a complex compensation logic; this judgment of whether to compensate is made across activities, and it requires some global data across activities. Of course, if the activiy x/y failed with output, it can be such a simple judgment:

Object output-x = ctx.callActivity("activity-x");
Object output-y = ctx.callActivity("activity-y");

......
Object output-a = ctx.callActivity("activity-a");
if (!output-x.isOK() && output-y.isOK) {
   ctx.registerCompensation("compensation-b")
}

But if the activiy x/y failed with exceptions, users have to do try/catch to let the workflow continue to execute activity-a when activity-x and activity-y are failed:

boolean isXFailed = true;
boolean isYFailed = true;
try {
    Object output-x = ctx.callActivity("activity-x");
    isXFailed = false;
} catch {...}
try{
    Object output-y = ctx.callActivity("activity-y");
    isYFailed = false;
} catch {...}

......
Object output-a = ctx.callActivity("activity-a");
if (isXFailed && !isYFailed) {
   ctx.registerCompensation("compensation-b")
}

I have to say that the flexibility of current proposal is very high, and users can always combine compensation strategies that meet their requirements.

From this point of view, I agree with Chris' suggestion: flexibility is more important.

@skyao
Copy link
Member Author

skyao commented Jan 2, 2024

While this proposal sounds useful I would like to see the core implementation / common interface, API methods outlined in pseudocode.

It's very hard to interpret Java code and think through how this would work in Python for example.

Could you update the proposal to identify the generic structure that all SDKs must implement?

In this proposal we can then discuss whether this will make sense. After well we want to have a unified implementation across all SDKs, so we will want to agree on the same implementation strategy in this proposal.

I think it can be done without another set of pseudo-code. After the java-sdk implementations are merged, I will contact the maintainers of the python sdk and .net sdk to come together and implement saga mode in the python sdk and .net sdk. I will then help the maintainers define the python and .net APIs and implementations directly and update them to this proposal.

I expect saga support for python-sdk and .net sdk to be available in dapr v1.14, and I promise I'll support it.

@artursouza
Copy link
Contributor

I agree with @cgillum to keep the activity and compensation loosely coupled to begin with. We can always add opinionated Facades on top if needed.

+1 binding

@skyao
Copy link
Member Author

skyao commented Mar 17, 2024

@artursouza Can we merge this PR now to add this proposal into main branch.

@joebowbeer
Copy link

joebowbeer commented Jun 26, 2024

Would it be helpful to add references to Saga support in other workflow tools?

In Temporal, for example, there is a helper class in Java but in other languages there is only sample code:

https://temporal.io/blog/saga-pattern-made-easy

In Java, the Saga class keeps track of compensations for you ... In other language SDKs you can easily write the addCompensation and compensate functions yourself.

(Followed by links to examples for Go, PHP, Python, and TypeScript, in addition to Java.)

In Golang there is no try/catch hell and defer is built into the language.

Even in Java, consider whether this should be part of the SDK or whether it should be part of a contrib library or even a DSL built on top of (one of) the Dapr SDKs.

Questions:

In a workflow-as-code solution such as Dapr, why can't different custom solutions and patterns be accomplished "in code" using the features of the programming language and its libraries?

Is Saga such a singularly useful pattern that it should to be built into every Dapr SDK?

@olitomlinson
Copy link

In a workflow-as-code solution such as Dapr, why can't different custom solutions and patterns be accomplished "in code" using the features of the programming language and its libraries? Is Saga such a singularly useful pattern that it should to be built into every Dapr SDK?

I agree. IMO Saga should be kept out of the main language SDK at this point in time.

@fkromer
Copy link

fkromer commented Mar 20, 2025

Just some comments from a potential Dapr user... I'm pretty new to and interested in Dapr. So far I've invested a handful hours.

Some issue specific comments:

  • @skyao Thanks a lot for this proposal to intiate caring about the important topic of distributed transactions in Dapr workflows using a saga pattern variant.
  • From the references in this issue it is not obvious to me what saga pattern variant Dapr is planning to support for workflows and how they relate to saga pattern variants most people will know about: Choreography-based or orchestration-based Saga?
  • I'd recommend a model based design first approach using (e.g. sequence) diagrams and pseudo SDK code. This has several advantages (problem to solve fully understood, bigger chance that initial implementation will not block future use cases, easier to collaborate/communicate, higher chance that Dapr will have cross-language kind of SDKs, diagrams can be used for docs, ...).

Some generic comments about distributed transaction patterns in Dapr:

  • Making distributed transactions as easy as possible would be on of the biggest additional values Dapr could provide to users.
  • Have you considered the 2 Phase-Commit (2PC) pattern in addition to a saga pattern variant? E.g. in some workflows people could rate consistency, simplicity and atomicity higher than scalability and availability.
  • According to non-Dapr/non-Diagrid web resources it seems like it's possible to implement saga pattern variants choreography-based and orchestration-based Saga in Dapr with actors. Those options do not abstract away the challenges w.r.t. those patterns however. They are just about Dapr framework specific implementations.
  • It would be helpful if the Dapr docs could relate to microservices patterns in a way similiar to how
    the Diagrid blog post Dapr as the Ultimate Microservices Patterns Framework tries to do (the blog post is misleading cause Dapr does not support the saga pattern in worfklows yet). Centralized docs would avoid misinformation and would make it way easier for potential users to evaluate Dapr and convince them to use Dapr.

@olitomlinson
Copy link

@fkromer

the blog post is misleading cause Dapr does not support the saga pattern in worfklows yet

The blog post states :

Dapr offers a concrete solution to implement the saga pattern through it's workflow API

Dapr doesn't ship with it's own opinionated Saga Implementation, but you absolutely can use Workflows to implement your own Saga model. The blog post is accurate IMO.

@joebowbeer
Copy link

@fkromer this is a proposal to add a saga helper to workflows, which in practice means making the compensating actions explicit, however, I have questions about the utility of this.

I think the question re. choreography vs orchestration is lost on this audience 😄

https://www.diagrid.io/blog/choreography-or-orchestration-that-is-the-wrong-question

@fkromer
Copy link

fkromer commented Mar 20, 2025

@olitomlinson

Dapr doesn't ship with it's own opinionated Saga Implementation, but you absolutely can use Workflows to implement your own Saga model. The blog post is accurate IMO.

Absolutely. My comment was inaccurate. What I actually mean is that Dapr does not significantly help in implementing a saga pattern from a framework/abstraction point of view yet. Actually that's what this issue is all about, right?

@fkromer
Copy link

fkromer commented Mar 20, 2025

@joebowbeer

Probably worth to point out in the first place: I don't want to fuel a framework x is better than framework y discussion. Both frameworks are great.

https://temporal.io/blog/saga-pattern-made-easy

The blog post summarizes pretty clearly what a developer is looking for: Enabling implementing a saga pattern in code comparatively trivial by freeing the developer from everything than to write compensations for workflow steps. Usually one needs explicit compensations for every single workflow step. Freeing includes to care about keeping track of state w.r.t. workflow state machine (saving state to know where to recover from in the face of failure) and idempotency. Temporal uses "resilient" orchestration based sagas. This opinionated choice is not ideal I think (see below). Is it better than nothing?

In Golang there is no try/catch hell and defer is built into the language.

Even in Java, consider whether this should be part of the SDK or whether it should be part of a contrib library or even a DSL built on top of (one of) the Dapr SDKs.

Questions:

In a workflow-as-code solution such as Dapr, why can't different custom solutions and patterns be accomplished "in code" using the features of the programming language and its libraries?

Is Saga such a singularly useful pattern that it should to be built into every Dapr SDK?

I think the question re. choreography vs orchestration is lost on this audience 😄

https://www.diagrid.io/blog/choreography-or-orchestration-that-is-the-wrong-question

"Choreography, operating without centralized control, is particularly effective for communication between bounded contexts." ... "Conversely, orchestration provides a structured approach within a bounded context, ensuring a controlled sequence of service interactions focused on state maintenance and error handling."

I agree with the article that the pattern variant to use should depend on the use case.

"Workflow: This API facilitates the definition and management of workflows in a distributed environment, ensuring harmonious coordination of multiple services through workflow patterns available such as task chaining, branching fan-out/fan-in and async calls to other services." ... "Dapr delivers unified APIs for service communication and infrastructure consumption."

If Dapr does not want to support distributed transactions with saga patterns in the scope of workflows this is totally fine. If Dapr aims at helping in implementing the saga pattern in the context of workflows it makes probably sense to

  • support both saga pattern variants (or even other distributed transaction patterns like 2PC in case it's most suitable for some use cases to be not opinionated) and
  • at an as high level as possible.

What's needed from a DSL point of view?

  • create a "Dapr workflow" (saga pattern "business transaction") + configuration of distributed transaction pattern (choreography based saga/orchestration based saga),
  • create "Dapr activities" (saga pattern "local transactions"),
  • assign "Dapr activities" to "Dapr workflow",
  • create "Dapr compensating activities" (saga pattern "compensations"),
  • assign "Dapr compensating activities" to corresponding "Dapr activities".
  • StartToCloseTimeout, RetryPolicy, idempotency key (uniquely identify a particular transaction and ensure a transaction occurs effectively once), ... ?

It could make sense to have a look at other frameworks (Temporal, MassTransit Saga State Machines, Eventuate Tram Sagas) how they domain model saga patterns. In an ideal world it would be possible to abstract away pattern implementation details behind the Workflow API (representing the Workflow building block). But in detail it could be pretty challenging I guess (you know way better than I do). As long as such saga pattern domain modeling questions are not answered on the DSL level it's probably best to provide examples and docs about how one can implement those patterns with Dapr instead of providing potentially incomplete, orchestration based saga helpers for Java only.

@joebowbeer
Copy link

joebowbeer commented Mar 20, 2025

@fkromer I think Saga can make sense in more structured (constrained) frameworks, e.g., step functions, but not in a code-first framework like Dapr's Workflows. Saga is a simple idea whereas real-world workflows get complicated. The strength of code-first workflows is their ability to handle complexity.

The dichotomy of choreography vs orchestration and the Saga pattern predate the new emerging landscape of Durable Computing (aka workflows)

https://www.golem.cloud/post/the-emerging-landscape-of-durable-computing

I think the only reason to even mention choreography and orchestration and Saga now are because they might be familiar to people approaching workflows from other places.

I think Bernd Rücker covers this well in his articles and book:

https://www.berndruecker.io/books/

@fkromer
Copy link

fkromer commented Mar 25, 2025

@joebowbeer

I think Saga can make sense in more structured (constrained) frameworks, e.g., step functions, but not in a code-first framework like Dapr's Workflows. Saga is a simple idea whereas real-world workflows get complicated. The strength of code-first workflows is their ability to handle complexity.

Does this really relate to the question of code-first vs. structured frameworks? Saga pattern variants are used to implement rather linear workflows with full transactionality, achieving at least once semantics with rollback on error in code. That's what people use in traditional microservice architectures if applying Domain Driven Design (DDD) for example. It's probably more a question about how a framework interprets and wants to interpret "workflows" in the future. According to workflow patterns Dapr interprets workflows pretty generic. For example w.r.t. fan-in/fan-out (data pipeline use case) there is no need for full transactionality in most use cases I guess. For many task chaining use cases there is need for. The example includes a compensation in the workflow scope. For many users it would be sufficient to add a task chaining code example with compensations on the activity level (instead of on the workflow level). People with a DDD background could be irritated cause workflows are app-scoped (no support for workflows spanning several bounded contexts). This limitation of usability of workflows in the DDD context can be compensated in a flexible manner in code with whatever pattern for distributed transactions (saga, 2pc, ...) fits an users use case best. Other frameworks add examples to their docs in such cases.

The dichotomy of choreography vs orchestration and the Saga pattern predate the new emerging landscape of Durable Computing (aka workflows)

https://www.golem.cloud/post/the-emerging-landscape-of-durable-computing

I think the only reason to even mention choreography and orchestration and Saga now are because they might be familiar to people approaching workflows from other places.

Absolutely agree. From a DSL point of view the pattern used to enable fully transactional workflow rollbacks is rather an implementation detail (implementation variants still having different pros and cons). Thanks for the hint. It's pretty interesting to learn about durable computing.

Would be interesting how Dapr project maintainers would fit Dapr into the comparison table

https://www.golem.cloud/post/the-emerging-landscape-of-durable-computing

I think Bernd Rücker covers this well in his articles and book:

https://www.berndruecker.io/books/

He (Camunda platform) interprets workflows from a BPMN perspective. Means a business process oriented workflow interpretation similar to task chaining but potentially way more complex. W.r.t. to error handling in the BPMN context there are compensation events and error events. Beyond the scope of this issue. Still interesting probably (examples in the docs how Dapr can be used to model BPMN diagrams/workflows). The books looks interesting. I hope I'll find some time to have a deeper look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.