Open
Description
In my situation, most game servers I designed so far use service
as an abstraction of everything. And millions of service could be in only a single process. For the purpose of robustness, service
manager catches errors/exceptions from all running services and chooses proper operations to them(kill service or just ignore it). It seems zig will panic at runtime when something like division by zero occurs, and it's not recoverable.
So is it possible to add an option for this situation? As far as I know, nim
has many compiler check switches to make these edge errors as runtime exceptions. rust
can do catch_unwind
after a panic. go
has a recover()
buitin funtion.
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
Rocknest commentedon Oct 23, 2019
There are no runtime exceptions in zig. Also it is undefined behavior if runtime safety is turned off (release-fast etc.)
JesseRMeyer commentedon Oct 23, 2019
This is probably not what you want. If a single fatal error occurs, then the entire process is destroyed. Instead, you want each service to run in its own process that communicates to other processes using some standard format. That way, if, say, the login service fails, players who are already logged in and playing are not booted from their session. Also, it makes it trivial to distribute across machines in a network. While that is a notable increase in complexity, the alternative of solving who catches which exception thrown by what when is probably just a rats nest waiting to happen.
mogud commentedon Oct 24, 2019
It's not possible to have millions of processes.
In fact, user space codes always use RPC for communication, and do not need a concern abount if it is across machines or not.
A gateway may keep players' connections, but typically, more than one thound players' game logic must be handled within a single process. It's not acceptable that players are all kicked out only because of a division by zero error. A proper way I think is to record the log and report it to the maintainers. And they will decide if it is neccesary to shutdown game server and fix it.
So, if we can assure defers/errdefers work well and have a way to stop unwind by an compiler option when a division by zero happend, we have more choices.
DaseinPhaos commentedon Oct 24, 2019
Besides that, the question remains on who gets to decide how fatal an error is.
DaseinPhaos commentedon Oct 24, 2019
Probably Relevent: #395, @thejoshwolfe 's comment on error handling
JesseRMeyer commentedon Oct 24, 2019
Yes, it is possible, especially on a distributed network of servers. But its feasibility depends on your definition of a service, kernel and related architectural choices. Whether we process a single user or tens of thousands of them on a single process is an important decision, and error propagation does seem to have a say here, regardless of my architecture comments.
Zig maintains the mantra of no hidden control flow, and software exceptions violate that principle outright. But I agree that users should wield full control over error handling. If the runtime already catches these errors, it should first propagate them to the user process to see if it cares and wants to handle it directly, and if not, return it back for the default behavior.
emekoi commentedon Oct 24, 2019
what's wrong with
and with #489, some of the performance hit from the check can be optimized away.
JesseRMeyer commentedon Oct 24, 2019
@emekoi Are you suggesting that as a user or Standard Library function?
Here's why -- I do not want to pollute every div() callsite I make with error handling, especially when I know that the dividend is not 0, as inputs are often sanitized long before computations on them are performed. This is the crux of the problem, if we address this at too fine a granularity then the whole structure around it pays the cost in support. I suppose in those cases, the binary / would suffice, so maybe renaming this to safe_div() would indicate its purpose.
mogud commentedon Oct 24, 2019
@emekoi
/
.overflow
is also an unrecoverable error, and does this means all builtin arithmetic operators cannot be used? I think it's really really inconvenient.Rocknest commentedon Oct 24, 2019
@mogud It is a bad idea to run 'untrusted' code in a single monolithic process, regardless it is zig/c or assembly, if you want to run it safely use some kind of sandbox. For example you can compile your 'services' to wasm.
emekoi commentedon Oct 24, 2019
@JesseRMeyer if you know that the dividend is not zero, then just use
/
. there shouldn't be an issue if your input is already sanitized. as for overflow, we have compiler intrinsics that handle overflow. you can also just catch the exceptions from the OS, and go from there.furthermore, if you're running a game server i think it is in your and your users best interests, if you carefully review all the libraries that you use...
JesseRMeyer commentedon Oct 24, 2019
How do we accomplish this in Zig?
emekoi commentedon Oct 24, 2019
it depends on the OS, but for windows you can use Structured Exception Handling like you would in C and for unix systems you can use signal handlers. we already use these to catch segmentation faults in debug mode on supported systems. the relevant code is from this line down.
JesseRMeyer commentedon Oct 24, 2019
Thanks.
If user code can explicitly override Zig's safety features with their own, then that makes me glad.
rohlem commentedon Oct 24, 2019
Other related issues that haven't been mentioned yet: #1740 , #426 (note: rejected), #1356 (note: only tangentially related, discussion seemed to disfavour recover-like mechanisms).
In debug builds, a Zig panic calls the root source file's panic handler (doesn't seem documented yet - mentioned in documentation of @panic ). You are free to provide an implementation with f.e. a longjmp -- anything that holds the
noreturn
return type, so doesn't expect to return directly to the panic-ed stack.The main issue is that in completely-optimized builds (ReleaseFast, ReleaseSmall), the LLVM IR that is emitted results in undefined behaviour. If you want uncompromised speed, you need to compromise recoverability (as far as I understand it). How recoverable that currently resulting undefined behaviour ends up being is left for the backend, currently LLVM, to decide.
If your main concern is correctness/stability, then allocating separate stack memory for each service invocation and having a longjmp-or-equivalent return plan from the panic handler might be an acceptable solution.
I also thought I remembered (but now can't find) another more in-depth discussion about turning each instance of detectable illegal behaviour into returning a standard error code - again, this prevents full-fledged optimizations.
Note that whatever judgement mainline Zig ends up pasing, with Zig's parser being part of the standard library, it might be feasible for you to add a compilation step that replaces certain unsafe expressions (like panicking operators) with safer function calls (like the error-returning alternatives from
std.math
, or a non-error fallback return value).15 remaining items
rohlem commentedon Oct 26, 2019
Currently
@panic
only receives a message, and the implementation ofstd.debug.panic
retrieves the stack frame information via other means.Assuming we can (note: limited to safe build modes) query whether the current stack is
async
, we could expose builtins@currentAwaiter() ?*anyawaiter
and@returnToAwaiter(*anyawaiter) noreturn
. Then the panic implementation could do:This way the recoverability is a completely optional feature (maybe even opt-in compile-time toggle-able, akin to --single-threaded). Since we already have safety features for resuming non-suspended functions, I'm 90% sure that this would already be implementable.
Filling the awaiter's return value seems a little tricky: We could have a builtin to provide
*@OpaqueType()
that can be cast if the type is consistent across all async functions in the codebase.Maybe error unions could be generalized in their layout to the point where the builtin can provide a
*anyerror
for anyanyerror!T
; that would make it quite elegant to use, actually.Otherwise switching on the type would require some runtime representation of the type (maybe via an auto-collected builtin enum similar to how
anyerror
is populated), but these ideas sound overcomplicating to me; for this particular use case a userland protocol would be sufficient:(As an alternative to
@currentAwaiter()
we could introduce a separatepanicAsync(?*awaiter) T
, and@panic
decides which one to use depending on if it's called on anasync
stack. Then the call of@returnToAwaiter
and maybe also setting the awaiter's awaited return value could be hidden after the return ofpanicAsync
(maybe of return typeanyerror
?). This would reduce both complexity and flexibility/control of the feature in my eyes.)suirad commentedon Nov 2, 2019
Since
@panic
is somewhat of an exception to the zig rule of no hidden control flow, out of necessity; perhaps it could be a tool in the modes in which it is available(debug/release-safe). It seems to me that something to the effect of a temporary panic handlers for a single scope could be feasible. Perhaps purity of the scope could determine the eligibility of code/functions used within it, since side effects affect the recover-ability of state.shawnl commentedon Nov 5, 2019
LLVM does not make all of these undefined behavior, but downgrades what can to only produce undefined values. Zig should know the difference, and be able to recover from undefined values. The big exception to this is divide by zero, which raises SIGFPE.
The general fix for this is to add
setjmp()
/longjmp()
support to zig, which is #1656.ityonemo commentedon May 15, 2020
Just wanted to add in that in my use case (FFI with the erlang VM) I'd like to turn on a panic trapping feature when I drive external unit test suites, so that a zig panic can record unreachable/undefined behavior inside zig from the calling VM in release-safe/release-debug. and not disrupt test counts/test tracking/CI. An opt-in ability to somehow trap a panic would be very useful. Conceptually this could easily take the form of a setjmp/longjmp that I can drop in at the zig/erlang boundary and recover from in the event of a panic. If zig doesn't want to support this that's fine, a panic during unit tests is also a valid way of alerting that there's something wrong with the code.
iacore commentedon Nov 23, 2022
@panic
seem to send SIGABRT. You can catch this inside the process, or in its parent process.Use shared memory & exit code to send error message back to parent.
matu3ba commentedon Apr 20, 2023
Recovery of errors (as to not run into failures) requires to specify what safe and well-defined states are. How should Zig know this? Once you can specify them: Why can you not code them?
Asking more specific: What is the recoverable state classes + execution context classes that Zig should support?
The purpose of optimization compilers and languages defining them is to provide optimal machine code for the supported performance use cases and requires to "explicitly write" possible code semantics.
Zig has performance defaults for math stuff, which includes
a / b
trapping the CPU and crashing your program in safe modes.As I understand you, you ask to enable the caller to change source code semantics, like what macros or operator overloading are used in C/C++ etc (typically to workaround bugs of called code (ie not intended for the use case). Is that correct?