-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry platform-level errors in the isolated process for .NET isolated #2922
Conversation
bbbb91f
to
724ebea
Compare
724ebea
to
d333530
Compare
…e-extension into dajusto/add-failed-orch-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left minor suggestion, otherwise, looks good
// From experience, this code runs in `<sourceCodePath>/bin/output/`, so we store the file two directories above. | ||
// We do this because the /bin/output/ directory gets overridden during the build process, which happens automatically | ||
// when `func host start` is re-invoked. | ||
string evidenceFile = System.IO.Path.Combine(System.IO.Directory.GetCurrentDirectory(), "..", "..", "replayEvidence"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would suggest giving these replayEvidence files test-specific names to avoid conflicts should multiple retry tests run simultaneously. Even if today they are run sequentially, this may not always be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right there's a pending improvement here. I chose not to do this for now because, based on how this is written today, the sequential checks protect us here, and we have some pressure to release. But I'm in favor about making this more robust long term.
"maxQueuePollingInterval": "00:00:01", | ||
"controlQueueVisibilityTimeout": "00:01:00" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just helps make sure we retry orchestrator replays fast enough after a platform-level error.
} | ||
} | ||
}, | ||
"functionTimeout": "00:00:30" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't time out sooner because our OOM-based test needs time to run out of memory.
|| (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false)) | ||
{ | ||
// TODO: the `WorkerProcessExitException` type is not exposed in our dependencies, it's part of WebJobs.Host.Script. | ||
// Should we add that dependency or should it be exposed in WebJobs.Host? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the string-based comparison, but it'll be string-based unless we add WebJobs.Host.Script
to our dependencies. Open to a discussion here, but please note we have a customer waiting so let's look to be practical. I have no strong feelings, just trying to deploy this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebJobs.Host.Script
is not available for public consumption. We will need to work with the host team / WebJobs SDK to expose this information to extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR! I have some questions and comments.
@@ -71,15 +71,15 @@ internal bool TryGetOrchestrationErrorDetails(out Exception? failure) | |||
return hasError; | |||
} | |||
|
|||
internal void SetResult(IEnumerable<OrchestratorAction> actions, string customStatus) | |||
internal void TrySetResult(IEnumerable<OrchestratorAction> actions, string customStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking nit: Any method named TryXXX
is usually expected to return a bool
value indicating whether it succeeded or failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right. II chose Try
to signal somehow that the method may throw, but I know the convention makes this confusing. I'll revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here: 090f8e7
{ | ||
if (failureDetails.InnerFailure?.IsCausedBy<OutOfMemoryException>() ?? false) | ||
{ | ||
throw new OutOfMemoryException(failureDetails.ErrorMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually throwing OutOfMemoryException
is a bit of a red flag. It's really only supposed to be thrown by the runtime. It's partly problematic because there are various pieces of code in .NET and elsewhere that treat OOM and similar "fatal" exceptions differently. In some cases, I think it's supposed to crash the process.
I strongly recommend we throw something else, like SessionAbortedException
, which is designed for this exact scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. I'm now throwing SessionAbortedException
, see: 090f8e7
@@ -475,6 +475,24 @@ | |||
</summary> | |||
<param name="result">The result.</param> | |||
</member> | |||
<member name="M:Microsoft.Azure.WebJobs.Extensions.DurableTask.RemoteOrchestratorContext.ThrowIfPlatformLevelException(DurableTask.Core.FailureDetails)"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why this is showing up in our public documentation? I thought these APIs were internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that since I added method-level comments to the private method, they're made it to the xml. I think this is probably just to power the intellisense? Not sure, just a guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just intellisense, but also the public API docs that we publish, like this: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.webjobs.extensions.durabletask?view=azure-dotnet. If it's not meant to be published, then we should remove it from this xml file. Might be worth doing a bit of research as when when/why non-public APIs are getting included in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that adding the method-level comments was the culprit.
I removed them (moved them to the method body), and the xml change was undone: c927268
} | ||
} | ||
|
||
// assuming the orchestrator survived the OOM, delete the evidence file and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this comment, and it's making me question whether I understand what this test is supposed to be doing. I'll ask simply: why would we expect an orchestrator to survive an OOM? It sounds like you're saying there was an OOM but it was handled and we kept running in spite of the OOM. Are you actually saying it recovered from a worker process crash (caused by the OOM)? Maybe OOM implies crash, but it's not 100% clear that's what you meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right the wording here was not great.
I do not expect an orchestration invocation to continue running beyond an OOM.
What I mean by survive was:
- orchestrator invocation A runs
- invocation A OOMs
- DF extension abandons the invocation
- DF extension retries the orchestrator invocation (now invocation B, every invocation is unique)
- invocation does not OOM now, and completes
So it "survived" because it retried.
I'm hoping the new comments here make this clearer: a70d014
Environment.FailFast("Simulating crash!"); | ||
} | ||
|
||
// assuming the orchestrator survived the OOM, delete the evidence file and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment need to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is the result of copy-pasting.
Updated: a70d014
|
||
// Function input comes from the request content. | ||
string instanceId = await client.ScheduleNewOrchestrationInstanceAsync( | ||
nameof(TimeoutOrchestrator)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than replicating the same HTTP function once for each scenario, it might make more sense to have a single HTTP trigger that takes an orchestrator function name as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly here, but my personal preference is to have specific HTTP endpoints for each orchestrator. When testing locally, it saves me time to just copy-paste the entire URL to trigger the orchestrator.
But it's no big deal. If you feel strongly, I'll generalize it. Just noting this was a deliberate choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly so I'll leave it up to you, but it would enable all the usual benefits of code sharing (easier to add new tests, better consistency b/c only one place to make updates, etc.)
Summarizing some offline discussion:
|
Thanks @jviau. I'll delegate this bit to you then:
As for the rest, I'll see what we can do. OOM is the key case we're trying to unblock via this PR (the rest are related "freebies"). @andystaples will start an internal thread about this. |
I think all of the points I brought up can also be addressed after this PR, but it is something we should commit to doing ASAP.
|
// - a worker process exit | ||
if (functionResult.Exception is Host.FunctionTimeoutException | ||
|| functionResult.Exception?.InnerException is SessionAbortedException // see RemoteOrchestrationContext.TrySetResultInternal for details on OOM-handling | ||
|| (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify that WorkerProcessExitException
is actually available here? Looking at the function host code, I don't see the exception actually thrown: https://github.com/Azure/azure-functions-host/blob/1088f24c3ae3a6275f18cbf091fa525c2477be91/src/WebJobs.Script/Workers/ProcessManagement/WorkerProcess.cs#L179-L184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's how we stumbled upon this error type in the first place. We found this error by forcing a process exit (Environment.FailFast
) and placing a breakpoint at this position.
But yeah I admit that I also don't see the explicit throw here. We can investigate further, or I can show you the runtime behavior through the debugger.
|| (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false)) | ||
{ | ||
// TODO: the `WorkerProcessExitException` type is not exposed in our dependencies, it's part of WebJobs.Host.Script. | ||
// Should we add that dependency or should it be exposed in WebJobs.Host? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebJobs.Host.Script
is not available for public consumption. We will need to work with the host team / WebJobs SDK to expose this information to extensions.
Supersedes: #2902
Joint work with @andystaples.
Problem: In .NET isolated today, platform-level errors in the isolated process like OOMs, Timeouts, and sudden process exits are not automatically retried by DF. Instead, those cause the orchestrator to be marked as failed. This is an unintended behavior change for .NET isolated. This PR aims to bring back robustness against those errors.
Most of the changes related to this exist in the
OutOfProcMiddleware
andRemoteOrchestrationContext
files."Extra scope": In trying to add automatic E2E tests to validate these improvements, I had to go down a rabbithole of .NET isolated testing issues in our smoke testing infrastructure today.
Long story short, our docker-based test runner for .NET isolated wasn't testing runtime correctness (it was passed the
-NoValidation
flag) because of intermittent race conditions when publishing .NET isolated apps to Docker. I didn't get far enough into debugging this to root cause the issue (@jviau tells me that upgrading the SDK may automatically solve them) but I did spend enough hours on it to conclude that, for the sake of unblocking this item, it would be practical to abandon the Docker-based testing for .NET isolated and instead use the GitHub actions CI environment directly.Therefore, this PR changes the definition of the .NET isolated smoke testing GitHub action to instead run the tests directly on the "ubuntu-latest" image. It also introduces a new test runner powershell script that's well-suited for simulating platform-level errors. I hope the comments in the code itself clarify the intent, but please let me know if not.
Ask for reviewers: the key shippable changes here are in
OutOfProcMiddleware
andRemoteOrchestrationContext
. Please look at those carefully. I've added comments explaining some of the quirky bits.