agent: update all APIs to use ApiErrorExt #2006

jgraettinger · 2025-03-13T04:53:19Z

Return considered status codes for most every error response.

Notably, return 404 if a valid task is attempting to authorize a collection which doesn't exist, as this lets the task recover if it's attempting to write ACK intents to a since-deleted collection.

Also refactor authorize_task() and authorize_dekaf() to validate the request token against its declared data-plane before doing anything else. This ensures the requestor is in possession of a correct data-plane HMAC key before we could possibly leak any information about whether the subject task exists or not.

Tested all APIs except for authorize_dekaf on a local stack.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

jgraettinger · 2025-03-13T04:55:22Z

Verified (on a local stack) that the following is fixed:

Create a capture with a collection
Disable the capture, and then disable its collection binding.
Delete the collection.
Re-enable the capture (without enabling its binding).

Previously this would fail on startup due to an error obtaining an authorization to the deleted collection, in order to write recovered ACK intents.

jgraettinger · 2025-03-18T17:21:36Z

ping @jshearer

jshearer

Lots of potentially security sensitive changes here. Bailing early if the request token is not valid against its declared data-plane is a great defense-in-depth measure. I wasn't able to find anything obviously wrong just reading over the diff.

I only 60% understand the consequences of the BlackHole stuff, but at a high level returning a signed request claims but for a selector that will never select anything sounds like a clever solution, though I'm not totally clear what's wrong with returning a selector for a journal that doesn't exist, if the request is authorized. I would assume that you'll issue broker RPCs and get the normal "JOURNAL_NOT_FOUND" response 🤔 I guess it has to do with this, which I don't really understand:

as this lets the task recover if it's attempting to write ACK intents to a since-deleted collection

Cleaning up the status codes and responses all LGTM

Return considered status codes for most every error response. Refactor authorize_task() and authorize_dekaf() to validate the request token against its declared data-plane before doing anything else. This ensures the requestor is in possession of a correct data-plane HMAC key before we could possibly leak any information about whether the subject task exists or not. Also rework authorize_task() to return a "black hole" token if a valid subject task is referencing an object journal which can't be found, either because its collection was deleted or its generation ID changed. A "black hole" token is a placeholder which allows the task to operate without failure as it awaits a control-plane update, and also lets it handle recovery scenarios where checkpoint ACK intents reference a since-deleted journal.

jgraettinger · 2025-03-19T23:04:30Z

Would you please sanity check authorize_dekaf on a local stack? I'm not set up to easily run it. I've verified all other APIs against a local stack.

I only 60% understand the consequences of the BlackHole stuff, but at a high level returning a signed request claims but for a selector that will never select anything sounds like a clever solution, though I'm not totally clear what's wrong with returning a selector for a journal that doesn't exist, if the request is authorized.

There's no authorization concern with black hole tokens -- the task is authorized -- but there are few reasons for returning such a token:

During recovery, when writing recovered ACK intents to indicated journals, we have handling for a regular Append RPC JOURNAL_NOT_FOUND but not for an authorization error getting a token to make that Append. So, this allows existing logic to discard ACK intents of deleted journals to kick in.
We don't want to allow a stale task to Apply a new partition of a since-deleted collection (or stale generation of a collection).
- We do want it to be able to "list" that collection though (where listing is a no-op), so it doesn't fail while processing other collections while awaiting a restart with an updated task spec.

jgraettinger requested a review from jshearer March 13, 2025 04:53

jgraettinger force-pushed the johnny/api-status-codes branch from e315ee6 to bb3e5c7 Compare March 14, 2025 20:46

jshearer approved these changes Mar 18, 2025

View reviewed changes

jgraettinger force-pushed the johnny/api-status-codes branch from bb3e5c7 to 280a77c Compare March 19, 2025 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: update all APIs to use ApiErrorExt #2006

agent: update all APIs to use ApiErrorExt #2006

jgraettinger commented Mar 13, 2025 •

edited

Loading

jgraettinger commented Mar 13, 2025

jgraettinger commented Mar 18, 2025

jshearer left a comment

jgraettinger commented Mar 19, 2025

agent: update all APIs to use ApiErrorExt #2006

Are you sure you want to change the base?

agent: update all APIs to use ApiErrorExt #2006

Conversation

jgraettinger commented Mar 13, 2025 • edited Loading

jgraettinger commented Mar 13, 2025

jgraettinger commented Mar 18, 2025

jshearer left a comment

Choose a reason for hiding this comment

jgraettinger commented Mar 19, 2025

jgraettinger commented Mar 13, 2025 •

edited

Loading