Skip to content

Conversation

robobario
Copy link
Member

No description provided.

Signed-off-by: Robert Young <[email protected]>
@robobario robobario requested a review from a team as a code owner August 27, 2025 23:12
Signed-off-by: Robert Young <[email protected]>
Signed-off-by: Robert Young <[email protected]>
Comment on lines 122 to 123
2. Support RSA and HSM-RSA key types, wrapping using `RSA-OAEP-256` but emit a warning that it is not quantum-resistant.
3. Support HSM-AES key type and AES-GCM wrapping
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on this, I think the EDEK should encode which key type we used, so that we can use the appropriate algorithm at key unwrapping time (Which will be 1:1 with whether the key type is RSA or AES)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, would querying the key's type also be an option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can discover the key type using the key name via the key vault apis.

In my prototype I'm using this to determine which algorithm to use (and fail fast if the key is an unsupported type) at resolveAlias time.

Yeah, maybe it's not such a big deal to make the extra request as the unwrapped DEK will be cached.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ail fast if the key is an unsupported type

I'd do that. We ought to do that in the other KMS implementations too :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing you can discover up front is whether unwrapKey and wrapKey are supported by the key, users are able to do things like create an RSA key that doesn't support unwrap/wrap.

And you can determine whether the key is enabled.

@k-wall
Copy link
Member

k-wall commented Aug 28, 2025

Looks like it is heading in a good direction.

@robobario
Copy link
Member Author

robobario commented Sep 4, 2025

Something new that cropped up while working on the integration tests:

The Azure key name restrictions are:

The name must be a 1-127 character string, containing only 0-9, a-z, A-Z, and -.

Our tests so far have expected that a kek selector of $(topicName) will function if the topic has underscores in its name.

What should we do? Some ideas:

  1. Add an expression like $(azureNormalizedTopicName), implying we will convert underscores to hyphens. We then make the Azure KMS fail fast if it receives an invalid key name.
  2. Add a new TopicNameBasedKekSelector implementation that can do character replacements like this, with configurable replacements and a delegate plugin (or extend the current implementation to also enable character replacements).
  3. Add opt-in-configuration like normalizeKeyNames: true to the KMS config, so if the kek selector picks a key name with underscores in it, we convert them to hyphens. I've done this in the prototype, just without the opt-in feature, to get the ITs passing.
  4. Rely on docs/user to know when a topic name is incompatible with key naming? Per topic keying seems like an unlikely setup in production with the cloud providers all charging roughly a dollar per key or per-key rotation.

Options 1,2 & 3 make it possible to keep the existing integration suite passing, 3 we would have to make some changes to the IT setup.

Option 3 I don't really like that the thing picked by the kekSelector isn't a name in Azure, that's it's job, pick an alias corresponding to something upstream. Option 1 is a bit annoying to surface Azure specifics in the RecordEncryption config.

@k-wall
Copy link
Member

k-wall commented Sep 11, 2025

Something new that cropped up while working on the integration tests:

The Azure key name restrictions are:

The name must be a 1-127 character string, containing only 0-9, a-z, A-Z, and -.

Our tests so far have expected that a kek selector of $(topicName) will function if the topic has underscores in its name.

What should we do? Some ideas:

  1. Add an expression like $(azureNormalizedTopicName), implying we will convert underscores to hyphens. We then make the Azure KMS fail fast if it receives an invalid key name.
  2. Add a new TopicNameBasedKekSelector implementation that can do character replacements like this, with configurable replacements and a delegate plugin (or extend the current implementation to also enable character replacements).
  3. Add opt-in-configuration like normalizeKeyNames: true to the KMS config, so if the kek selector picks a key name with underscores in it, we convert them to hyphens. I've done this in the prototype, just without the opt-in feature, to get the ITs passing.
  4. Rely on docs/user to know when a topic name is incompatible with key naming? Per topic keying seems like an unlikely setup in production with the cloud providers all charging roughly a dollar per key or per-key rotation.

Options 1,2 & 3 make it possible to keep the existing integration suite passing, 3 we would have to make some changes to the IT setup.

Option 3 I don't really like that the thing picked by the kekSelector isn't a name in Azure, that's it's job, pick an alias corresponding to something upstream. Option 1 is a bit annoying to surface Azure specifics in the RecordEncryption config.

I'm tempted to say, let's just document this wrinkle for now and create an issue for it. Adapt any tests to cope with the tighter rules. TopicNameBasedKekSelector is good enough that it allows a user to try out Azure KMS.

I think we know that TopicNameBasedKekSelector isn't really what we want. Tieing the topic name to the key name is inflexible. The knowledge about which key should be used to encrypt which topic belongs somewhere else, but I don't think we have a clear idea exactly where. Aside: I did wonder if we could have a KekSelector that is driven from metadata (tags) on the keys themselves.

@robobario
Copy link
Member Author

robobario commented Sep 12, 2025

driven from metadata (tags) on the keys themselves.

yeah this dovetails with the Label API discussions, whether we should make it possible to label entities that are completely unrelated to Kafka or unanticipated currently.

I'm tempted to say, let's just document this wrinkle for now and create an issue for it. Adapt any tests to cope with the tighter rules. TopicNameBasedKekSelector is good enough that it allows a user to try out Azure KMS.

Sounds good to me, I wonder what a clean way to handle the test updates will be. We are depending on the topic names produced by the test extension. I guess we could move topic creation into the tests. Or maybe we could let the test influence it more with a naming strategy, something like: @TopicNamingStrategy(ALPHA_NUMERIC), where the default is the current behaviour.

edit: I've implemented something like that here kroxylicious/kroxylicious-junit5-extension#517

@k-wall
Copy link
Member

k-wall commented Sep 15, 2025

Have you looked how we'll ensure that the filter operates with least privilege? In the case of the other KMS, we've documented what privileges the filter needs and given guidance about how to achieve that. It would be good to include the approach we'll talk to least priviledge in the proposal.

@k-wall
Copy link
Member

k-wall commented Sep 15, 2025

Thanks @robobario. Almost there, I think.

Can I ask you restructure to use a Reject Alternatives section and pull down the detail about things we are not doing to that section? This might seem pedantic, but I think there is value in driving towards a consistent approach for proposals. Also remember that a contributor wishing to implement support for KMS X is likely to clone this proposal as a starting point, so let's make sure it is a great example to follow.

@k-wall k-wall self-requested a review September 15, 2025 10:27
@k-wall
Copy link
Member

k-wall commented Sep 15, 2025

I'm also wondering if it is wise to give the code the responsibility to emit the quantum warning. I think it might be better if we made it a responsibility of the documentation. PQC is an area of active research and the situation could change. The warnings (or lack of them) baked into code could mislead users. We can always update documentation to change PCQ statements.

@robobario
Copy link
Member Author

robobario commented Sep 17, 2025

keen for you to have a look too @tombentley, the bit that gives me most pause is supporting RSA. Without it users have no way to use the affordable tiers of Key Vault (that I suspect are commonly used). The product docs endorse it for this key-wrapping use-case, but it doesn't have the quantum-resistant nod.

We will support AES keys with AES GCM wrapping but it's only available on the hellaciously expensive Managed HSM.

@k-wall
Copy link
Member

k-wall commented Sep 18, 2025

Have you looked how we'll ensure that the filter operates with least privilege? In the case of the other KMS, we've documented what privileges the filter needs and given guidance about how to achieve that. It would be good to include the approach we'll talk to least priviledge in the proposal.

@robobario did you have chance to think about this?

@robobario
Copy link
Member Author

Have you looked how we'll ensure that the filter operates with least privilege? In the case of the other KMS, we've documented what privileges the filter needs and given guidance about how to achieve that. It would be good to include the approach we'll talk to least priviledge in the proposal.

@robobario did you have chance to think about this?

Will think about how to document it, in my exploration I used:

    accessPolicies: [
      {
        tenantId: tenantId
        objectId: appObjectId
        permissions: {
          keys: [
            'get'
            'wrapKey'
            'unwrapKey'
          ]
        }
      }
    ]

as the policy to give minimum privileges, then when creating the key enabled the minimum operations the key needs:

keyOps: [
      'wrapKey'
      'unwrapKey'
    ]

@robobario robobario moved this to In Progress in 2025_Q3 Sep 19, 2025
Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robobario, I had a few questions, but over all this is great effort and I think it's nearly there.


The API usually returns it as a `kid` https://learn.microsoft.com/en-us/azure/key-vault/general/about-keys-secrets-certificates#object-identifiers. Like

> For Vaults: https://{vault-name}.vault.azure.net/{object-type}/{object-name}/{object-version}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be explicit about which parts of this can vary and how. I've not read the docs, but I'm imagining that:

  • {object-name} will obviously vary
  • {object-version} will obviously vary
  • {object-type} is/may always be keys
  • {vault-name} ??

Let's consider too the point you made above:

Users can also install it on-premise by negotiation.

I guess that would imply that the host name cannot be assumed to end with .vault.azure.net. That means that if we are going to "compress" the start of the URL it needs to be uncompressed by prefixing with the configured keyVaultBaseUri. But doing this comes with a commitment: That the keyVaultBaseUri will never be changed between when any given record is encrypted and when it's decrypted. If it does change then there is a period of time during which it's ambiguous how to uncompress a compressed key id (with the old URI or the new URI).

Can we assume that keys will absolutely never migrate between instances? If this is possible then again there are potential issues for users. They would need for the key ids using the old domain name to continue to work for the duration of the life time of those records. We developers cannot put an upper bound on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that would imply that the host name cannot be assumed to end with .vault.azure.net.

Correct, it could also differ for other national/sovereign clouds like China's Azure instance, or the US government instance.

{vault-name} ??

This is the user-defined name for their Key Vault.

That means that if we are going to "compress" the start of the URL it needs to be uncompressed by prefixing with the configured keyVaultBaseUri

Correct, that's what is proposed

Some relevant capabilities:

Key Vaults can be backed up and restored within an Azure Geography and Subscription (docs). Meaning that within a geo region and subscription they could clone a key/version to a new {vault-name}, potentially the objects can exist in both vaults simultaneously.

Contrasting with AWS, there is no intermediate layer similar to the named Key Vaults. In AWS you only work with Keys. You can't backup and restore keys, but AWS it has a concept of multi-region-keys to use keys across regions. In the AWS EDEK, we encode just the key id which looks to have some limitations kroxylicious/kroxylicious#1217.

So maybe we should record more details in the records, like the vault-name? That would let us support the user moving to a new key vault and keys, while continuing to decrypt from the old vault-name. To support a migration of existing keys to a new vault-name maybe we could do it by configuration, like the user could configure something like alias(from=my-old-kv, to=my-new-kv) and we'd handle the substitution.

There are only a very limited number of public clouds, so maybe we could get some compression there by using an enum to record if they are using public cloud and infer the host from that. Else, store the whole thing.

for cloud in $(az cloud list --query "[].name" -o tsv); do   endpoint=$(az cloud show --name $cloud --query suffixes.keyvaultDns -o tsv);   echo "$cloud: $endpoint"; done
AzureCloud: .vault.azure.net
AzureChinaCloud: .vault.azure.cn
AzureUSGovernment: .vault.usgovcloudapi.net
AzureGermanCloud: .vault.microsoftazure.de

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should serialize the object name. To enable the user to move from one key vault to another within the same cloud. We would still operate on the assumption that there is a single Key Vault in use for encryption purposes, but support decryption from multiple.

It is theoretically possible that a user could have the same vault name in multiple national clouds, but it seems like an unlikely setup, Entra is also tied to a particular cloud, so we can't auth across multiple clouds with our one token. I'm wondering if we are better off saying we are limited to working with a single cloud and externalizing the vault host in proxy configuration.

Are there other considerations? Like is it good to have the vault hostname in the record as some kind of record about where the key originated, even if it's taking up some bytes?

so some options are:

  1. We serialize the object name and vault hostname as 2 fields in the serialized EDEK. We don't attempt to compress the {object-name} and vault hostname.
  2. the same as option 1, except we have some smarts to compress the vault hostname if it's a known public hosted vault address as above.
  3. we serialize only the {object-name} and use a vault hostname from proxy configuration. Initially we would support only a single fixed vault hostname for all records, but we could in future enable users to configure a hostname per vault object name.

Copy link
Member Author

@robobario robobario Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the proposal with option 3. The KMS will only support operating in the context of a single cloud for now. We could always evolve towards storing the vault host later if we had some request to support operating across national clouds.

Maybe multi-tenant is more likely, where an organization has some Keys under one Azure tenant and some under another tenant. Either way I think we should wait for a request first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option 3 is the right approach for now. I agree, waiting for user requirements is the right way.

Copy link
Member Author

@robobario robobario Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing here, I think we should also serialize a byte for the key type (which implies the wrapping algorithm). This means at decrypt time we will have everything we need to make the unwrap request, and not need to fetch the key to get it's key type. Feels like a sensible thing to encode anyway as a record of how it was wrapped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have pushed that up

@k-wall
Copy link
Member

k-wall commented Sep 19, 2025

Have you looked how we'll ensure that the filter operates with least privilege? In the case of the other KMS, we've documented what privileges the filter needs and given guidance about how to achieve that. It would be good to include the approach we'll talk to least priviledge in the proposal.

@robobario did you have chance to think about this?

Will think about how to document it, in my exploration I used:

    accessPolicies: [
      {
        tenantId: tenantId
        objectId: appObjectId
        permissions: {
          keys: [
            'get'
            'wrapKey'
            'unwrapKey'
          ]
        }
      }
    ]

as the policy to give minimum privileges, then when creating the key enabled the minimum operations the key needs:

keyOps: [
      'wrapKey'
      'unwrapKey'
    ]

I don't think we need more details in the proposal beyond saying what are the minimum permissions that proxy will need to operate. Out of curiosity, if the KMS sniffs the key to check its type, does it need get.

Copy link
Member

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really thorough job @robobario. This doc will be a great foundation for future KMS proposals.

@k-wall
Copy link
Member

k-wall commented Sep 19, 2025

This should probably be proposal 009

@k-wall k-wall changed the title Add azure kms proposal 010 - Add azure kms proposal Sep 19, 2025
@k-wall k-wall changed the title 010 - Add azure kms proposal 009 - Add azure kms proposal Sep 19, 2025
@robobario
Copy link
Member Author

robobario commented Sep 22, 2025

Out of curiosity, if the KMS sniffs the key to check its type, does it need get.

Yes, I believe that's what the get action covers, letting us get the key metadata. We also need it to get the latest key version, because the wrap operation appears to require the keyVersion https://learn.microsoft.com/en-us/rest/api/keyvault/keys/wrap-key/wrap-key?view=rest-keyvault-keys-7.4&tabs=HTTP

I've realised this get, unwrapKey, wrapKey is an outdated keyvault authorization style, and under RBAC I think the relevant actions are:

Microsoft.KeyVault/vaults/keys/read
Microsoft.KeyVault/vaults/keys/wrap/action
Microsoft.KeyVault/vaults/keys/unwrap/action	

will try it out

edit: there's actually a built-in role which contains exactly those permissions

Key Vault Crypto Service Encryption User	Read metadata of keys and perform wrap/unwrap operations. Only works for key vaults that use the 'Azure role-based access control' permission model.	e147488a-f6f5-4113-8e2d-b22465e65bf6

edit:

I've confirmed it works with RBAC configured with these permissions. Also found that the API docs for wrap is incorrect as it works even if you don't supply the keyVersion (uses the latest version). So maybe we could technically avoid fetching the key metadata if the user configured the wrapping algorithm explicitly. The wrap response includes the keyId. But using the key metadata also allows us to check if the operations are supported by the key and whether it's enabled and generate more actionable error messages.

robobario and others added 5 commits September 22, 2025 12:53
Co-authored-by: Tom Bentley <[email protected]>
Signed-off-by: Robert Young <[email protected]>
Co-authored-by: Tom Bentley <[email protected]>
Signed-off-by: Robert Young <[email protected]>
Signed-off-by: Robert Young <[email protected]>
Signed-off-by: Robert Young <[email protected]>
Copy link
Member

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling nit but otherwise LGTM

Co-authored-by: Keith Wall <[email protected]>
Signed-off-by: Robert Young <[email protected]>
@k-wall
Copy link
Member

k-wall commented Sep 25, 2025

Also found that the API docs for wrap is incorrect as it works even if you don't supply the keyVersion (uses the latest version). So maybe we could technically avoid fetching the key metadata if the user configured the wrapping algorithm explicitly.

Sounds like something worthwhile to flag up to Mircosoft. If we can avoid an API call, I think that's worthwhile.

Copy link
Member

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This enables us to execute the unwrap without having to describe the key
again to obtain it's type. It also adds a useful record of how the key
was encrypted, with the downside that it costs a byte per record.

Signed-off-by: Robert Young <[email protected]>
@robobario
Copy link
Member Author

robobario commented Oct 6, 2025

nudge to @kroxylicious/developers, keen for a second approval before merging this.

Latest changes were I added the vault name and key type to the serialized EDEK written into each record, so that we have all the information we need to unwrap the key, without having to make a separate API call to describe the key and learn its type on the decrypt path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants