Revisit error strategy #618

Mirko-von-Leipzig · 2025-01-14T08:06:43Z

We currently follow the thiserror for libs and anyhow for binaries rule-of-thumb. However, our crates in this repo are more a way of separating concerns and components -- and are not really intended as libraries (excepting the RPC/proto maybe).

This can be seen in our error handling where we almost never match on error variants, but instead wrap and bubble them up. One notable exception is in RPC methods where we need to differentiate between internal and user-facing errors.

We should re-evaluate our error handling strategy. At minimum we could probably merge a bunch of the various error enums e.g. store errors are per method, but these could probably just be a single enum.

A more extreme approach could be to default to anyhow::Result unless required otherwise. RPC errors would likely just be split into external/internal wrappers of anyhow, and if a component becomes interested in a specific error condition then that variant could be added.

The text was updated successfully, but these errors were encountered:

Mirko-von-Leipzig · 2025-01-15T09:26:58Z

I believe we can broadly categorise node errors into user errors and operation errors.

User facing errors

These are caused by external forces acting upon the node aka some form of user interaction, and require that the failure is communicated back to the user.

These are restricted to the RPC/web-interface. Here we need to be careful to not expose internal errors that the user can

do nothing about, and
could exploit to hurt node performance

Essentially, our RPC and web services errors are split into internal and user errors. User errors are likely limited to

request deserialization errors
bad/invalid request
rate-limiting

and maybe some others I haven't thought of. Almost everything else would be Internal(anyhow::Error) which gets logged but not exposed to the user.

Operation errors

Internal errors which can imo be split into two categories:

Recoverable aka log and continue operations
Irrecoverable aka panic the task/process/node (whichever is applicable)

To allow for (1) we may need some error variants, but I suspect we'll find most of our existing operational errors lead to a node shutdown.

In my experience it is often better to let components fail completely, and rather invest in a robust restart process. You require such a process in any case to cater for the (2) errors, and deciding on how to recover from (1) in a safe way can often be more trouble than its worth.

Proposed approach

Start at the top-level binary and work out which errors simply cause node death. An obvious example is our store initialization aka database setup and validation. If that fails there is literally nothing to be done. Replace such errors with anyhow and see what falls out.

Mirko-von-Leipzig · 2025-01-15T09:31:35Z

We should also decide whether to treat each node component as isolated. Or put differently: does the store consider the block-producer and rpc as users, or can these be seen as internal.

I'm pretty strongly in the the latter camp since we already do things like assume the rpc has validated the transaction proof. Or in other words, we already fully trust the other components.

This implies that certain RPC interfaces are considered fully internal, and don't need to separate internal/user errors. Though we probably should be more specific here since rpc acts as a proxy for the store in many cases.

bobbinth · 2025-01-15T09:34:47Z

I'm pretty strongly in the the latter camp since we already do things like assume the rpc has validated the transaction proof. Or in other words, we already fully trust the other components.

Yes, I think we can assume that the same entity would run these components (potentially on different machines) and so they would fully trust each other.

PhilippGackstatter · 2025-01-15T15:00:09Z

User facing errors

For those I would especially make sure that they contain helpful details and give the user some idea of what could be done to fix an RPC call, if anything. Not exactly sure how errors translate to RPC, but assuming they're stringified we would probably want to generate a report at this boundary (e.g. using anyhow).

Almost everything else would be Internal(anyhow::Error) which gets logged but not exposed to the user.

That's probably a good strategy. These errors might expose internal details of the node or the system the node is running on, which could be considered an information leak and potentially be exploited.

Mirko-von-Leipzig mentioned this issue Jan 15, 2025

refactor: standardize errors #611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit error strategy #618

Revisit error strategy #618

Mirko-von-Leipzig commented Jan 14, 2025

Mirko-von-Leipzig commented Jan 15, 2025 •

edited

Loading

Mirko-von-Leipzig commented Jan 15, 2025

bobbinth commented Jan 15, 2025

PhilippGackstatter commented Jan 15, 2025

Revisit error strategy #618

Revisit error strategy #618

Comments

Mirko-von-Leipzig commented Jan 14, 2025

Mirko-von-Leipzig commented Jan 15, 2025 • edited Loading

User facing errors

Operation errors

Proposed approach

Mirko-von-Leipzig commented Jan 15, 2025

bobbinth commented Jan 15, 2025

PhilippGackstatter commented Jan 15, 2025

Mirko-von-Leipzig commented Jan 15, 2025 •

edited

Loading