Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit error strategy #618

Open
Mirko-von-Leipzig opened this issue Jan 14, 2025 · 4 comments
Open

Revisit error strategy #618

Mirko-von-Leipzig opened this issue Jan 14, 2025 · 4 comments

Comments

@Mirko-von-Leipzig
Copy link
Contributor

We currently follow the thiserror for libs and anyhow for binaries rule-of-thumb. However, our crates in this repo are more a way of separating concerns and components -- and are not really intended as libraries (excepting the RPC/proto maybe).

This can be seen in our error handling where we almost never match on error variants, but instead wrap and bubble them up. One notable exception is in RPC methods where we need to differentiate between internal and user-facing errors.

We should re-evaluate our error handling strategy. At minimum we could probably merge a bunch of the various error enums e.g. store errors are per method, but these could probably just be a single enum.

A more extreme approach could be to default to anyhow::Result unless required otherwise. RPC errors would likely just be split into external/internal wrappers of anyhow, and if a component becomes interested in a specific error condition then that variant could be added.

@Mirko-von-Leipzig
Copy link
Contributor Author

Mirko-von-Leipzig commented Jan 15, 2025

I believe we can broadly categorise node errors into user errors and operation errors.

User facing errors

These are caused by external forces acting upon the node aka some form of user interaction, and require that the failure is communicated back to the user.

These are restricted to the RPC/web-interface. Here we need to be careful to not expose internal errors that the user can

  1. do nothing about, and
  2. could exploit to hurt node performance

Essentially, our RPC and web services errors are split into internal and user errors. User errors are likely limited to

  1. request deserialization errors
  2. bad/invalid request
  3. rate-limiting

and maybe some others I haven't thought of. Almost everything else would be Internal(anyhow::Error) which gets logged but not exposed to the user.

Operation errors

Internal errors which can imo be split into two categories:

  1. Recoverable aka log and continue operations
  2. Irrecoverable aka panic the task/process/node (whichever is applicable)

To allow for (1) we may need some error variants, but I suspect we'll find most of our existing operational errors lead to a node shutdown.

In my experience it is often better to let components fail completely, and rather invest in a robust restart process. You require such a process in any case to cater for the (2) errors, and deciding on how to recover from (1) in a safe way can often be more trouble than its worth.

Proposed approach

Start at the top-level binary and work out which errors simply cause node death. An obvious example is our store initialization aka database setup and validation. If that fails there is literally nothing to be done. Replace such errors with anyhow and see what falls out.

@Mirko-von-Leipzig
Copy link
Contributor Author

We should also decide whether to treat each node component as isolated. Or put differently: does the store consider the block-producer and rpc as users, or can these be seen as internal.

I'm pretty strongly in the the latter camp since we already do things like assume the rpc has validated the transaction proof. Or in other words, we already fully trust the other components.

This implies that certain RPC interfaces are considered fully internal, and don't need to separate internal/user errors. Though we probably should be more specific here since rpc acts as a proxy for the store in many cases.

@bobbinth
Copy link
Contributor

I'm pretty strongly in the the latter camp since we already do things like assume the rpc has validated the transaction proof. Or in other words, we already fully trust the other components.

Yes, I think we can assume that the same entity would run these components (potentially on different machines) and so they would fully trust each other.

@PhilippGackstatter
Copy link

User facing errors

For those I would especially make sure that they contain helpful details and give the user some idea of what could be done to fix an RPC call, if anything. Not exactly sure how errors translate to RPC, but assuming they're stringified we would probably want to generate a report at this boundary (e.g. using anyhow).

Almost everything else would be Internal(anyhow::Error) which gets logged but not exposed to the user.

That's probably a good strategy. These errors might expose internal details of the node or the system the node is running on, which could be considered an information leak and potentially be exploited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants