Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: update all APIs to use ApiErrorExt #2006

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Mar 13, 2025

Return considered status codes for most every error response.

Notably, return 404 if a valid task is attempting to authorize a collection which doesn't exist, as this lets the task recover if it's attempting to write ACK intents to a since-deleted collection.

Also refactor authorize_task() and authorize_dekaf() to validate the request token against its declared data-plane before doing anything else. This ensures the requestor is in possession of a correct data-plane HMAC key before we could possibly leak any information about whether the subject task exists or not.

Tested all APIs except for authorize_dekaf on a local stack.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@jgraettinger jgraettinger requested a review from jshearer March 13, 2025 04:53
@jgraettinger
Copy link
Member Author

Verified (on a local stack) that the following is fixed:

  • Create a capture with a collection
  • Disable the capture, and then disable its collection binding.
  • Delete the collection.
  • Re-enable the capture (without enabling its binding).

Previously this would fail on startup due to an error obtaining an authorization to the deleted collection, in order to write recovered ACK intents.

@jgraettinger jgraettinger force-pushed the johnny/api-status-codes branch from e315ee6 to bb3e5c7 Compare March 14, 2025 20:46
@jgraettinger
Copy link
Member Author

ping @jshearer

Copy link
Contributor

@jshearer jshearer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of potentially security sensitive changes here. Bailing early if the request token is not valid against its declared data-plane is a great defense-in-depth measure. I wasn't able to find anything obviously wrong just reading over the diff.

I only 60% understand the consequences of the BlackHole stuff, but at a high level returning a signed request claims but for a selector that will never select anything sounds like a clever solution, though I'm not totally clear what's wrong with returning a selector for a journal that doesn't exist, if the request is authorized. I would assume that you'll issue broker RPCs and get the normal "JOURNAL_NOT_FOUND" response 🤔 I guess it has to do with this, which I don't really understand:

as this lets the task recover if it's attempting to write ACK intents to a since-deleted collection

Cleaning up the status codes and responses all LGTM

Return considered status codes for most every error response.

Refactor authorize_task() and authorize_dekaf() to validate the
request token against its declared data-plane before doing anything
else. This ensures the requestor is in possession of a correct
data-plane HMAC key before we could possibly leak any information
about whether the subject task exists or not.

Also rework authorize_task() to return a "black hole" token if a
valid subject task is referencing an object journal which can't be
found, either because its collection was deleted or its generation ID
changed. A "black hole" token is a placeholder which allows the task to
operate without failure as it awaits a control-plane update, and also
lets it handle recovery scenarios where checkpoint ACK intents reference
a since-deleted journal.
@jgraettinger jgraettinger force-pushed the johnny/api-status-codes branch from bb3e5c7 to 280a77c Compare March 19, 2025 22:58
@jgraettinger
Copy link
Member Author

Would you please sanity check authorize_dekaf on a local stack? I'm not set up to easily run it. I've verified all other APIs against a local stack.

I only 60% understand the consequences of the BlackHole stuff, but at a high level returning a signed request claims but for a selector that will never select anything sounds like a clever solution, though I'm not totally clear what's wrong with returning a selector for a journal that doesn't exist, if the request is authorized.

There's no authorization concern with black hole tokens -- the task is authorized -- but there are few reasons for returning such a token:

  • During recovery, when writing recovered ACK intents to indicated journals, we have handling for a regular Append RPC JOURNAL_NOT_FOUND but not for an authorization error getting a token to make that Append. So, this allows existing logic to discard ACK intents of deleted journals to kick in.

  • We don't want to allow a stale task to Apply a new partition of a since-deleted collection (or stale generation of a collection).

    • We do want it to be able to "list" that collection though (where listing is a no-op), so it doesn't fail while processing other collections while awaiting a restart with an updated task spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants