Skip to content

use case: ability to recover from illegal behavior in safe build modes #3516

Open
@mogud

Description

@mogud
Contributor

In my situation, most game servers I designed so far use service as an abstraction of everything. And millions of service could be in only a single process. For the purpose of robustness, service manager catches errors/exceptions from all running services and chooses proper operations to them(kill service or just ignore it). It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable.
So is it possible to add an option for this situation? As far as I know, nim has many compiler check switches to make these edge errors as runtime exceptions. rust can do catch_unwind after a panic. go has a recover() buitin funtion.

Activity

Rocknest

Rocknest commented on Oct 23, 2019

@Rocknest
Contributor

There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)

JesseRMeyer

JesseRMeyer commented on Oct 23, 2019

@JesseRMeyer

And millions of service could be in only a single process.

This is probably not what you want. If a single fatal error occurs, then the entire process is destroyed. Instead, you want each service to run in its own process that communicates to other processes using some standard format. That way, if, say, the login service fails, players who are already logged in and playing are not booted from their session. Also, it makes it trivial to distribute across machines in a network. While that is a notable increase in complexity, the alternative of solving who catches which exception thrown by what when is probably just a rats nest waiting to happen.

mogud

mogud commented on Oct 24, 2019

@mogud
ContributorAuthor

Instead, you want each service to run in its own process that communicates to other processes using some standard format.

It's not possible to have millions of processes.

Also, it makes it trivial to distribute across machines in a network.

In fact, user space codes always use RPC for communication, and do not need a concern abount if it is across machines or not.
A gateway may keep players' connections, but typically, more than one thound players' game logic must be handled within a single process. It's not acceptable that players are all kicked out only because of a division by zero error. A proper way I think is to record the log and report it to the maintainers. And they will decide if it is neccesary to shutdown game server and fix it.
So, if we can assure defers/errdefers work well and have a way to stop unwind by an compiler option when a division by zero happend, we have more choices.

DaseinPhaos

DaseinPhaos commented on Oct 24, 2019

@DaseinPhaos

Instead, you want each service to run in its own process that communicates to other processes using some standard format.

It's not possible to have millions of processes.

Besides that, the question remains on who gets to decide how fatal an error is.

DaseinPhaos

DaseinPhaos commented on Oct 24, 2019

@DaseinPhaos
JesseRMeyer

JesseRMeyer commented on Oct 24, 2019

@JesseRMeyer

It's not possible to have millions of processes.

Yes, it is possible, especially on a distributed network of servers. But its feasibility depends on your definition of a service, kernel and related architectural choices. Whether we process a single user or tens of thousands of them on a single process is an important decision, and error propagation does seem to have a say here, regardless of my architecture comments.

Zig maintains the mantra of no hidden control flow, and software exceptions violate that principle outright. But I agree that users should wield full control over error handling. If the runtime already catches these errors, it should first propagate them to the user process to see if it cares and wants to handle it directly, and if not, return it back for the default behavior.

emekoi

emekoi commented on Oct 24, 2019

@emekoi
Contributor

what's wrong with

fn safe_div(a: var, b: @typeOf(a)) !@typeOf(a) {
    @setRuntimeSafety(false);
    if (b == 0) return error.DivisionByZero;
	return a / b;
}

and with #489, some of the performance hit from the check can be optimized away.

JesseRMeyer

JesseRMeyer commented on Oct 24, 2019

@JesseRMeyer

@emekoi Are you suggesting that as a user or Standard Library function?

Here's why -- I do not want to pollute every div() callsite I make with error handling, especially when I know that the dividend is not 0, as inputs are often sanitized long before computations on them are performed. This is the crux of the problem, if we address this at too fine a granularity then the whole structure around it pays the cost in support. I suppose in those cases, the binary / would suffice, so maybe renaming this to safe_div() would indicate its purpose.

mogud

mogud commented on Oct 24, 2019

@mogud
ContributorAuthor

@emekoi

  1. It is verbose enough that everywhere I must use div function call instead of a simple binary operator.
  2. It is awful to review others' codes in order to make sure they follow the right way, or I have to create an static analyzer.
  3. It is hard to reuse third-party libraries, because obviously, they use /.
  4. overflow is also an unrecoverable error, and does this means all builtin arithmetic operators cannot be used? I think it's really really inconvenient.
Rocknest

Rocknest commented on Oct 24, 2019

@Rocknest
Contributor

@mogud It is a bad idea to run 'untrusted' code in a single monolithic process, regardless it is zig/c or assembly, if you want to run it safely use some kind of sandbox. For example you can compile your 'services' to wasm.

emekoi

emekoi commented on Oct 24, 2019

@emekoi
Contributor

@JesseRMeyer if you know that the dividend is not zero, then just use /. there shouldn't be an issue if your input is already sanitized. as for overflow, we have compiler intrinsics that handle overflow. you can also just catch the exceptions from the OS, and go from there.

furthermore, if you're running a game server i think it is in your and your users best interests, if you carefully review all the libraries that you use...

JesseRMeyer

JesseRMeyer commented on Oct 24, 2019

@JesseRMeyer

you can also just catch the exceptions from the OS, and go from there

How do we accomplish this in Zig?

emekoi

emekoi commented on Oct 24, 2019

@emekoi
Contributor

it depends on the OS, but for windows you can use Structured Exception Handling like you would in C and for unix systems you can use signal handlers. we already use these to catch segmentation faults in debug mode on supported systems. the relevant code is from this line down.

JesseRMeyer

JesseRMeyer commented on Oct 24, 2019

@JesseRMeyer

Thanks.

If user code can explicitly override Zig's safety features with their own, then that makes me glad.

rohlem

rohlem commented on Oct 24, 2019

@rohlem
Contributor

Other related issues that haven't been mentioned yet: #1740 , #426 (note: rejected), #1356 (note: only tangentially related, discussion seemed to disfavour recover-like mechanisms).

In debug builds, a Zig panic calls the root source file's panic handler (doesn't seem documented yet - mentioned in documentation of @panic ). You are free to provide an implementation with f.e. a longjmp -- anything that holds the noreturn return type, so doesn't expect to return directly to the panic-ed stack.

The main issue is that in completely-optimized builds (ReleaseFast, ReleaseSmall), the LLVM IR that is emitted results in undefined behaviour. If you want uncompromised speed, you need to compromise recoverability (as far as I understand it). How recoverable that currently resulting undefined behaviour ends up being is left for the backend, currently LLVM, to decide.

If your main concern is correctness/stability, then allocating separate stack memory for each service invocation and having a longjmp-or-equivalent return plan from the panic handler might be an acceptable solution.

I also thought I remembered (but now can't find) another more in-depth discussion about turning each instance of detectable illegal behaviour into returning a standard error code - again, this prevents full-fledged optimizations.
Note that whatever judgement mainline Zig ends up pasing, with Zig's parser being part of the standard library, it might be feasible for you to add a compilation step that replaces certain unsafe expressions (like panicking operators) with safer function calls (like the error-returning alternatives from std.math, or a non-error fallback return value).

15 remaining items

rohlem

rohlem commented on Oct 26, 2019

@rohlem
Contributor

So a proposal to make detected illegal behavior recoverable would have to solve the problem that jumping straight to the panic function from an async function would leave the awaiter hanging. If an async function does not make it to the return statement, its awaiter will hang forever, likely leaking resources, or worse, breaking invariants of data structures.

Currently @panic only receives a message, and the implementation of std.debug.panic retrieves the stack frame information via other means.
Assuming we can (note: limited to safe build modes) query whether the current stack is async, we could expose builtins @currentAwaiter() ?*anyawaiter and @returnToAwaiter(*anyawaiter) noreturn. Then the panic implementation could do:

fn panic(...) noreturn {
    any_panic_impl(...); //print stack trace etc.
    if(@currentAwaiter()) |awaiter| {
        //Potentially fill/initialize the return value the awaiter is awaiting; trickier, see below.
        @returnToAwaiter(awaiter); //note: of type noreturn
    }
    os.abort(); //or whatever else you do if you panic on the main stack (or on a stack currently without an awaiter)
}

This way the recoverability is a completely optional feature (maybe even opt-in compile-time toggle-able, akin to --single-threaded). Since we already have safety features for resuming non-suspended functions, I'm 90% sure that this would already be implementable.

Filling the awaiter's return value seems a little tricky: We could have a builtin to provide *@OpaqueType() that can be cast if the type is consistent across all async functions in the codebase.
Maybe error unions could be generalized in their layout to the point where the builtin can provide a *anyerror for any anyerror!T ; that would make it quite elegant to use, actually.

Otherwise switching on the type would require some runtime representation of the type (maybe via an auto-collected builtin enum similar to how anyerror is populated), but these ideas sound overcomplicating to me; for this particular use case a userland protocol would be sufficient:

//scheduler
var succeeded: bool = false;
const success_result = async failable_afunc(&succeeded, ...);
if(succeeded){
    //use success_result ...
}else{
    //handle failure... | success_result is undefined, do not access!
}

fn failable_afunc(succeeded: *bool) T {
    defer succeeded.* = true; //we need to somehow prohibit the optimizer from executing the assignment any earlier, which might not appear observable locally.
        //application logic implementation
}

(As an alternative to @currentAwaiter() we could introduce a separate panicAsync(?*awaiter) T, and @panic decides which one to use depending on if it's called on an async stack. Then the call of @returnToAwaiter and maybe also setting the awaiter's awaited return value could be hidden after the return of panicAsync (maybe of return type anyerror ?). This would reduce both complexity and flexibility/control of the feature in my eyes.)

suirad

suirad commented on Nov 2, 2019

@suirad
Contributor

Since @panic is somewhat of an exception to the zig rule of no hidden control flow, out of necessity; perhaps it could be a tool in the modes in which it is available(debug/release-safe). It seems to me that something to the effect of a temporary panic handlers for a single scope could be feasible. Perhaps purity of the scope could determine the eligibility of code/functions used within it, since side effects affect the recover-ability of state.

shawnl

shawnl commented on Nov 5, 2019

@shawnl
Contributor

There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)

LLVM does not make all of these undefined behavior, but downgrades what can to only produce undefined values. Zig should know the difference, and be able to recover from undefined values. The big exception to this is divide by zero, which raises SIGFPE.

The general fix for this is to add setjmp()/longjmp() support to zig, which is #1656.

modified the milestones: 0.6.0, 0.7.0 on Feb 21, 2020
ityonemo

ityonemo commented on May 15, 2020

@ityonemo
SponsorContributor

Just wanted to add in that in my use case (FFI with the erlang VM) I'd like to turn on a panic trapping feature when I drive external unit test suites, so that a zig panic can record unreachable/undefined behavior inside zig from the calling VM in release-safe/release-debug. and not disrupt test counts/test tracking/CI. An opt-in ability to somehow trap a panic would be very useful. Conceptually this could easily take the form of a setjmp/longjmp that I can drop in at the zig/erlang boundary and recover from in the event of a panic. If zig doesn't want to support this that's fine, a panic during unit tests is also a valid way of alerting that there's something wrong with the code.

modified the milestones: 0.7.0, 0.8.0 on Oct 17, 2020
modified the milestones: 0.8.0, 0.9.0 on Nov 6, 2020
added
use caseDescribes a real use case that is difficult or impossible, but does not propose a solution.
on Mar 21, 2021
modified the milestones: 0.9.0, 0.10.0 on May 19, 2021
iacore

iacore commented on Nov 23, 2022

@iacore
Contributor

@panic seem to send SIGABRT. You can catch this inside the process, or in its parent process.

Use shared memory & exit code to send error message back to parent.

matu3ba

matu3ba commented on Apr 20, 2023

@matu3ba
Contributor

recover from illegal behavior

Recovery of errors (as to not run into failures) requires to specify what safe and well-defined states are. How should Zig know this? Once you can specify them: Why can you not code them?

Asking more specific: What is the recoverable state classes + execution context classes that Zig should support?

It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable.

The purpose of optimization compilers and languages defining them is to provide optimal machine code for the supported performance use cases and requires to "explicitly write" possible code semantics.
Zig has performance defaults for math stuff, which includes a / b trapping the CPU and crashing your program in safe modes.

As I understand you, you ask to enable the caller to change source code semantics, like what macros or operator overloading are used in C/C++ etc (typically to workaround bugs of called code (ie not intended for the use case). Is that correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    use caseDescribes a real use case that is difficult or impossible, but does not propose a solution.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @andrewrk@shawnl@ityonemo@SpexGuy@JesseRMeyer

        Issue actions

          use case: ability to recover from illegal behavior in safe build modes · Issue #3516 · ziglang/zig