-
Notifications
You must be signed in to change notification settings - Fork 39
Add saga pattern support in Dapr workflow #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sky Ao <[email protected]>
While this proposal sounds useful I would like to see the core implementation / common interface, API methods outlined in pseudocode. It's very hard to interpret Java code and think through how this would work in Python for example. Could you update the proposal to identify the generic structure that all SDKs must implement? In this proposal we can then discuss whether this will make sense. After well we want to have a unified implementation across all SDKs, so we will want to agree on the same implementation strategy in this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please, get approval from other SDK maintainers. It is not a blocker for Java SDK implementation, but it might change based on how this proposal evolves - which is OK since the workflow SDK is version 0.x.
I agree. I think the implementation detail will evolve as the Java implementation serves as a reference. I am not blocking the Java implementation waiting for this proposal to be merged but knowing the implementation can change (ok for version 0.x of the workflow SDK). |
Good suggestion, I will update it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main comment is about the registering of the activity and its compensation separately, I'd prefer to not do that if we can avoid it.
I am a bit worried about having to reimplement this across different SDKs though, specifically around how we do error handling/compensation triggering. If we can register the activity together with its compensation, is it possible to have the runtime trigger it? I think that makes the code cleaning and leaves less implementation details in the SDK.
Chatted with Chris a little bit and it seems like the above isn't really something the runtime is aware of atm in terms of which activity is executing. I'm just curious if we've thought about it/how much work it may be and if it's worth it. In an ideal world I'd love to see the runtime be able to walk back the entire workflow itself but I may be off in terms of what it's capable of right now.
Thanks @skyao for this great proposal. I agree with @halspang's comments. Besides that, LGTM! |
Overall the proposal LGTM. I agree with both @halspang's and @kaibocai's comments. Should we add another function like Additionally, should a hook be provided so that users can look into the error/exception if thrown before the compensation is triggered i.e. |
I would love to see first-class support for Sagas in Workflows across all languages. FWIW I had the same idea a good few months ago (which was based on the same idea I had of using Sagas for Azure Durable Functions 2 years ago) so it's nice to see that you've arrived in a place along the same line as where I was thinking! |
The proposal LGTM overall, thanks @skyao. Agree with the comments here, specifically (1) not having to register a compensation separately, and (2) if we can offload certain things to runtime, to avoid having to duplicate this to all SDKs and maintain them. |
We'll start with java, and hopefully this proposal will be accepted soon so that the implementation of saga pattern in the dapr java sdk can be released in the dapr v1.13. Then I plan to add python and .net support in next dapr release v1.14. Chris has agreed to do some optimization work in workflow for saga support, as you see, "first-class support". This should all happen soon. |
Sharing my thoughts on the feedback received so far:
@halspang (and others) if I understand the concern correctly, this is about reducing boilerplate. While I sympathize with that, I worry about how this can make the saga pattern overly opinionated and less useful. For example, I don't think there will always be a 1:1 relationship between an activity and its compensation. Furthermore, the exact compensation strategy (which activity to call) and parameters may also need to be different for the same activity executed at different points in the workflow. I would prefer that we start by erring on the side of a loosely coupled design.
I understand the appeal of this idea but have two major concerns:
I'll also point out that the activity registration verbosity problem can be solved in other ways, such as what @mukundansundar has proposed in the Python SDK. In other words, we may be able to treat these concerns separately. My overarching hope is that the saga implementation can be loosely coupled to the overall SDK logic, especially given that it's a brand-new thing that we want to get more real-world feedback on. |
@kaibocai I agree that the ability to have some custom logic is important, which is partly why I strongly hesitate against tightly coupling an activity's execution to its compensation. In the proposed model, which is loosely coupled, developers can execute custom logic (e.g., logging) by catching exceptions, logging in the catch block, and then rethrowing the exception (or use manual compensation) to trigger the compensation. I don't think this approach is opposed to the proposed solution. I think the auto-compensation done by the SDK should be seen as a convenience feature. |
@mukundansundar similar to my previous responses, this isn't necessary with the current proposal because you can control if/when the compensation gets triggered. These hooks that you're suggesting are only required if we tightly couple an activity invocation with its compensation logic, which I'm arguing we shouldn't do because it creates inflexibility (and requires us to complicate the design with behavior customizing hooks, etc.). |
Having thought this through, I agree with @cgillum comments. Tightly coupling the compensation action to the Activity is very opinionated. Let's assume that given a customers use-case, the opinionated model worked fine for a while, but then the user wished to change the Workflow to do something different that doesn't follow the opinionated model, how would they 'break-out' and write a custom compensation just for that one Activity? |
@halspang I'm so sorry to delete a comment by mistaken. This deleted note and the previous one said the same thing, and I replied to them together. |
Signed-off-by: Sky Ao <[email protected]>
Went through the proposal today. @skyao I think it's a great proposal, that would immensely benefit users. I too agree with comments regarding keeping registration of compensation separate than activity registration. @skyao I also wanted to understand a bit more on lines of how compensation will work in different scenarios i.e. An Activity A's compensation may need to be called, if workflow fails at an activity X's level, but Activity A's compensation may NOT be needed to be called if workflow fails at Activity Y's level. |
Currently we have not considered such a complex compensation logic; this judgment of whether to compensate is made across activities, and it requires some global data across activities. Of course, if the activiy x/y failed with output, it can be such a simple judgment: Object output-x = ctx.callActivity("activity-x");
Object output-y = ctx.callActivity("activity-y");
......
Object output-a = ctx.callActivity("activity-a");
if (!output-x.isOK() && output-y.isOK) {
ctx.registerCompensation("compensation-b")
} But if the activiy x/y failed with exceptions, users have to do try/catch to let the workflow continue to execute activity-a when activity-x and activity-y are failed: boolean isXFailed = true;
boolean isYFailed = true;
try {
Object output-x = ctx.callActivity("activity-x");
isXFailed = false;
} catch {...}
try{
Object output-y = ctx.callActivity("activity-y");
isYFailed = false;
} catch {...}
......
Object output-a = ctx.callActivity("activity-a");
if (isXFailed && !isYFailed) {
ctx.registerCompensation("compensation-b")
} I have to say that the flexibility of current proposal is very high, and users can always combine compensation strategies that meet their requirements. From this point of view, I agree with Chris' suggestion: flexibility is more important. |
I think it can be done without another set of pseudo-code. After the java-sdk implementations are merged, I will contact the maintainers of the python sdk and .net sdk to come together and implement saga mode in the python sdk and .net sdk. I will then help the maintainers define the python and .net APIs and implementations directly and update them to this proposal. I expect saga support for python-sdk and .net sdk to be available in dapr v1.14, and I promise I'll support it. |
I agree with @cgillum to keep the activity and compensation loosely coupled to begin with. We can always add opinionated Facades on top if needed. +1 binding |
@artursouza Can we merge this PR now to add this proposal into main branch. |
Would it be helpful to add references to Saga support in other workflow tools? In Temporal, for example, there is a helper class in Java but in other languages there is only sample code: https://temporal.io/blog/saga-pattern-made-easy
(Followed by links to examples for Go, PHP, Python, and TypeScript, in addition to Java.) In Golang there is no try/catch hell and Even in Java, consider whether this should be part of the SDK or whether it should be part of a contrib library or even a DSL built on top of (one of) the Dapr SDKs. Questions: In a workflow-as-code solution such as Dapr, why can't different custom solutions and patterns be accomplished "in code" using the features of the programming language and its libraries? Is Saga such a singularly useful pattern that it should to be built into every Dapr SDK? |
I agree. IMO Saga should be kept out of the main language SDK at this point in time. |
Just some comments from a potential Dapr user... I'm pretty new to and interested in Dapr. So far I've invested a handful hours. Some issue specific comments:
Some generic comments about distributed transaction patterns in Dapr:
|
The blog post states :
Dapr doesn't ship with it's own opinionated Saga Implementation, but you absolutely can use Workflows to implement your own Saga model. The blog post is accurate IMO. |
@fkromer this is a proposal to add a saga helper to workflows, which in practice means making the compensating actions explicit, however, I have questions about the utility of this. I think the question re. choreography vs orchestration is lost on this audience 😄 https://www.diagrid.io/blog/choreography-or-orchestration-that-is-the-wrong-question |
Absolutely. My comment was inaccurate. What I actually mean is that Dapr does not significantly help in implementing a saga pattern from a framework/abstraction point of view yet. Actually that's what this issue is all about, right? |
Probably worth to point out in the first place: I don't want to fuel a framework x is better than framework y discussion. Both frameworks are great. The blog post summarizes pretty clearly what a developer is looking for: Enabling implementing a saga pattern in code comparatively trivial by freeing the developer from everything than to write compensations for workflow steps. Usually one needs explicit compensations for every single workflow step. Freeing includes to care about keeping track of state w.r.t. workflow state machine (saving state to know where to recover from in the face of failure) and idempotency. Temporal uses "resilient" orchestration based sagas. This opinionated choice is not ideal I think (see below). Is it better than nothing?
"Choreography, operating without centralized control, is particularly effective for communication between bounded contexts." ... "Conversely, orchestration provides a structured approach within a bounded context, ensuring a controlled sequence of service interactions focused on state maintenance and error handling." I agree with the article that the pattern variant to use should depend on the use case. "Workflow: This API facilitates the definition and management of workflows in a distributed environment, ensuring harmonious coordination of multiple services through workflow patterns available such as task chaining, branching fan-out/fan-in and async calls to other services." ... "Dapr delivers unified APIs for service communication and infrastructure consumption." If Dapr does not want to support distributed transactions with saga patterns in the scope of workflows this is totally fine. If Dapr aims at helping in implementing the saga pattern in the context of workflows it makes probably sense to
What's needed from a DSL point of view?
It could make sense to have a look at other frameworks (Temporal, MassTransit Saga State Machines, Eventuate Tram Sagas) how they domain model saga patterns. In an ideal world it would be possible to abstract away pattern implementation details behind the Workflow API (representing the Workflow building block). But in detail it could be pretty challenging I guess (you know way better than I do). As long as such saga pattern domain modeling questions are not answered on the DSL level it's probably best to provide examples and docs about how one can implement those patterns with Dapr instead of providing potentially incomplete, orchestration based saga helpers for Java only. |
@fkromer I think Saga can make sense in more structured (constrained) frameworks, e.g., step functions, but not in a code-first framework like Dapr's Workflows. Saga is a simple idea whereas real-world workflows get complicated. The strength of code-first workflows is their ability to handle complexity. The dichotomy of choreography vs orchestration and the Saga pattern predate the new emerging landscape of Durable Computing (aka workflows) https://www.golem.cloud/post/the-emerging-landscape-of-durable-computing I think the only reason to even mention choreography and orchestration and Saga now are because they might be familiar to people approaching workflows from other places. I think Bernd Rücker covers this well in his articles and book: |
Does this really relate to the question of code-first vs. structured frameworks? Saga pattern variants are used to implement rather linear workflows with full transactionality, achieving at least once semantics with rollback on error in code. That's what people use in traditional microservice architectures if applying Domain Driven Design (DDD) for example. It's probably more a question about how a framework interprets and wants to interpret "workflows" in the future. According to workflow patterns Dapr interprets workflows pretty generic. For example w.r.t. fan-in/fan-out (data pipeline use case) there is no need for full transactionality in most use cases I guess. For many task chaining use cases there is need for. The example includes a compensation in the workflow scope. For many users it would be sufficient to add a task chaining code example with compensations on the activity level (instead of on the workflow level). People with a DDD background could be irritated cause workflows are app-scoped (no support for workflows spanning several bounded contexts). This limitation of usability of workflows in the DDD context can be compensated in a flexible manner in code with whatever pattern for distributed transactions (saga, 2pc, ...) fits an users use case best. Other frameworks add examples to their docs in such cases.
Absolutely agree. From a DSL point of view the pattern used to enable fully transactional workflow rollbacks is rather an implementation detail (implementation variants still having different pros and cons). Thanks for the hint. It's pretty interesting to learn about durable computing. Would be interesting how Dapr project maintainers would fit Dapr into the comparison table
He (Camunda platform) interprets workflows from a BPMN perspective. Means a business process oriented workflow interpretation similar to task chaining but potentially way more complex. W.r.t. to error handling in the BPMN context there are compensation events and error events. Beyond the scope of this issue. Still interesting probably (examples in the docs how Dapr can be used to model BPMN diagrams/workflows). The books looks interesting. I hope I'll find some time to have a deeper look into it. |
Related issue is #48