-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: update all APIs to use ApiErrorExt #2006
base: master
Are you sure you want to change the base?
Conversation
Verified (on a local stack) that the following is fixed:
Previously this would fail on startup due to an error obtaining an authorization to the deleted collection, in order to write recovered ACK intents. |
e315ee6
to
bb3e5c7
Compare
ping @jshearer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of potentially security sensitive changes here. Bailing early if the request token is not valid against its declared data-plane is a great defense-in-depth measure. I wasn't able to find anything obviously wrong just reading over the diff.
I only 60% understand the consequences of the BlackHole
stuff, but at a high level returning a signed request claims but for a selector that will never select anything sounds like a clever solution, though I'm not totally clear what's wrong with returning a selector for a journal that doesn't exist, if the request is authorized. I would assume that you'll issue broker RPCs and get the normal "JOURNAL_NOT_FOUND" response 🤔 I guess it has to do with this, which I don't really understand:
as this lets the task recover if it's attempting to write ACK intents to a since-deleted collection
Cleaning up the status codes and responses all LGTM
Return considered status codes for most every error response. Refactor authorize_task() and authorize_dekaf() to validate the request token against its declared data-plane before doing anything else. This ensures the requestor is in possession of a correct data-plane HMAC key before we could possibly leak any information about whether the subject task exists or not. Also rework authorize_task() to return a "black hole" token if a valid subject task is referencing an object journal which can't be found, either because its collection was deleted or its generation ID changed. A "black hole" token is a placeholder which allows the task to operate without failure as it awaits a control-plane update, and also lets it handle recovery scenarios where checkpoint ACK intents reference a since-deleted journal.
bb3e5c7
to
280a77c
Compare
Would you please sanity check authorize_dekaf on a local stack? I'm not set up to easily run it. I've verified all other APIs against a local stack.
There's no authorization concern with black hole tokens -- the task is authorized -- but there are few reasons for returning such a token:
|
Return considered status codes for most every error response.
Notably, return 404 if a valid task is attempting to authorize a collection which doesn't exist, as this lets the task recover if it's attempting to write ACK intents to a since-deleted collection.
Also refactor authorize_task() and authorize_dekaf() to validate the request token against its declared data-plane before doing anything else. This ensures the requestor is in possession of a correct data-plane HMAC key before we could possibly leak any information about whether the subject task exists or not.
Tested all APIs except for authorize_dekaf on a local stack.
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is