Flexible groups #61

evrardjp-cagip · 2025-01-10T14:49:30Z

Without this patch, we are filtering LDAP groups and take a decision on what to expose.

This is a problem, as it removes the flexibility of rolebindings later.

We intend to expose custom role bindings for extra services (for example Kong), which requires teams to be entering a different model. In that case, the platform team creates a new cluster role, but the grantee of the role might come from a deployment tool, hence outside the operator.

I fixed it by first exposing the groups directly in the token provider. For that, I had to clean the token provider first, then fixing a few existing code's panics, cleaning the http handlers, adding more tests, simplifying auth, removing useless code, streamlining ldap requests, fixing the config validation, exposing errors into the main loop instead of silently ignoring them, and remove dependencies to rely more on standard library.

Then I focused on getting those groups into the token reviewer.

Finally, I modified the operator to be able to use the new groups from the project's spec.

This PR took the opportunity to clean up some code to make it more idiomatic, simplify the code, improving its logging.

It's a big PR, so I suggest you to review the code one commit at a time.

evrardjp-cagip · 2025-01-13T19:51:16Z

This might need a rebase, but I would like to know where we are heading with prs as NONE of the previous ones were merged

Without this patch, we are filtering LDAP groups and take a decision on what to expose. This is a problem, as it removes the flexibility of rolebindings later. We intend to expose custom role bindings for extra services (for example Kong), which requires teams to be entering a different model. In that case, the platform team creates a new cluster role, but the grantee of the role might come from a deployment tool, hence outside the operator. I fixed it by first exposing the groups directly in the token provider. This means that further commits are required to filter properly + generate the right token groups directly.

Without this, the code was duplicated and the generation of claims was not very readable. For example, it contained steps that are part of the issuer initialisation. This is a problem, as it leads to difficult reviews and difficulty to iterate on the topic. This fixes it by creating a new constructor for the token issuer, which clarifies all what's necessary for it, and fatals if the requirements are not met. This makes sure the code does not error forever when the requirements are not met. On top of that, the signature of the token was separated, allowing easier testing. The testing has shown a lack of handling the errors in the JWT signature checks, which should be fixed in a later commit.

Without this, creating a token provider with no private key will still try to sign the token. This should never happen. However, acting on it, even on tests, means a generic panic, instead of a just an error. This handles the pointer dereferencing to ensure the code does not panic, as it should already fatal on main through the constructor.

Without this, the basic auth is included deep in the call stack. This is a problem, as it means multiple calls have the information about passwords, and need to carry useless data around. This fixes it by ensuring a new middleware for basicAuth was added, directly connecting to ldap. The user information is then filtered to keep only username, userDN, and email. It is then passed in a context for use in the next httpHandlers. At the same time, it allowed me to see that the GenerateConfig would not fail if the generation of the token results in an error. I fixed it by adding the same logic in GenerateConfig and in GenerateJWT http handlers.

Without this we will not be able to refactor GenerateJWT and GerateConfig. This further allows testability for the config generation.

This allows simplification of the code.

This makes it easier to reason around the getUser details.

Without this, the naming is a bit hard to follow, and the indentation is not really matching the usual go patterns. This fixes it to make the code more readable.

Without this, we validate the token format twice: one when doing parsewithclaims, one when we do our own validation. This serves no purposes, and introduces fragile code. This fixes it by removing the useless commits.

We use different but similiar constructor methods for the Has* calls. This is a problem, as it makes the ldap methods unnecessary harder to read. This fixes it by ensuring the code is more readable, and allowed us optimisations, like regrouping code to connect, query, and return results with our defaults.

Without this, one might expect to have groups[] to be always populated, and passed to JWTClaims generation. However, this is not the case, as the groups are empty if the user does not have direct access rights to the cluster. This is a problem, as it will prevent further evolution of the group management. To fix this, I made sure that _ALL THE GROUPS_, including the special groups (appops, customerops, cloudops, containerops) have their groups fetched and generated in the JWT. I also moved the code from ldap to project.go, as none of the code was actually relevant from ldap perspective: All the code was manipulating project objects. Tests were added to ensure the behaviour was intact. The code also took the opportunity to remove incorrectly exported functions back to internal functions (and fixing their tests).

Now that authprovider is merely doing ldap functions (it was already doing that), be explicit and call the package ldap.

Presenting the CA is a very small endpoint, lost in the middle of the "services" internal package. This is a problem, as it makes it annoying to find. On top of that, it needed to be passed global variables instead of having direct access to config data. This fixes it by making sure this endpoint (only used once) is directly readable from the main, as it's a two-liner.

Services does not really means what this code does. This is a problem, as it makes the debugging tedious for a new contributor. This fixes it by moving the middlewares to their own package.

Without this, the pointer to types.User is passed to the context. This is a problem, as we do interface conversion. I fixed it by ensuring we pass values instead of changing the expected type at dest.

Without this, one might wonder where the constants are used. One might even think that Service account is used to provision some parts of kubi, where in fact it is only used for auth. This is a problem, as it could lead to misunderstandings in the code. In other words, I believe that moving the constants will make it more explicit about their scope and maintenance. Therefore it makes clear that the constants moved here are ONLY usable for auth. To avoid issues, this removes the constant from generic "utils" use.

Without this, the helpers become a big mess of functions that are only partially used or are inferior in implementation to what you can find in other parts of the code. This fixes it by: - regroup the config parsing into config.go - inlining functions with little use, which can be replaced by a few idiomatic golang calls. - temporarily moving some implementations until the inlining is safe to implement (needs test coverage) - removing duplicate calls: ldap host parsing in the config, fatal + os.exit(1)...

This commit tidies the go.mod after cleaning the code removing the usage of ozzo validation.

the createAccessToken method is bubbling up errors, but we never show them. This is a problem, as it prevents observability of the errors. This fixes it by logging the issues at error level.

Without this, much code is relying on global variables, or to state in the different functions. This is a problem, as it removes the testability, and code is hard to refactor. This fixes it by slowly getting away from spaghetti code, starting by refactoring the data structures. This prevents leakage of data. It will also allow us later to get rid of the global variables. This makes the code more idiomatic, and split the internal packages based on what they should provide (and how can they interface). Therefore, the code should be more readable in the long run.

The code of the operator was in a massive single file. To improve readability, this was split into multiple files. It also took the liberty to merge the anonymous function with the handler doing approximatively the same work, as none of those were tested.

A few calls are long duplicates. Without this patch, the code for netpols, rolebindings, and projects is duplicated in a few methods. This is a problem, as it makes the code harder to read, and harder to maintain. This fixes it by refactoring the calls, making them more testable (although without adding tests), and making them simpler to read.

While those are not technically necessary, they have helped me in my refactors. Instead of throwing them away, I will keep them here. If necessary, delete them, as they were refactor unit tests.

Without this, we will rely on our legacy group names. This is a problem, as it means we do not gain the flexibility brought by the previous "add group" feature automatically in each project. This fixes it by ensuring the rolebinding adds a new line for the users of the group.

No reason to use it in utils if it is not used in a global way.

The kubeconfig generation does not need to use global variables, if they are passed from issuer/global CA. This simplifies the code further to improve testability.

If the public key is nil, the key validation function from ParseWithClaims should NOT return the issuer public Key. The code snippet checks if the EcdsaPublic field of the issuer (an instance of TokenIssuer) is nil. The EcdsaPublic field holds the public key used to verify the token's signature. If this field is nil, it means that the public key is not available, and the method cannot proceed with the token verification. If the EcdsaPublic field is indeed nil, the method now returns nil and an error. This ensures that the caller of the VerifyToken method is informed about the missing public key, which is crucial for debugging and handling the error appropriately. This check is essential to prevent the method from attempting to verify the token without a valid public key, which would result in a runtime error or incorrect verification results.

Without this, the unittest is caring about the order of the list, which is not relevant for this test. We simply need to make sure we have the same group names in the different cases. This fixes the test by sorting them in advance.

Without this, the groups can only come from the pre-existing bases. This will be a problem, as certain teams (KONG) need to come from a completely different part of the org, and will not be allowed to log in. This fixes it by adding another base for query, which will be at higher level than the previous ones. This has an impact on performance, as LDAP will now query on broader domain groups. This is compensated by NOT doing multiple smaller queries later, but keeping all the results and filtering them in this code. Hence, a sorting/filtering is included in this code. The filtering of all those groups is made with a regexp, which MUST be passed in the config. A sorting of the existing special groups (project, appops, admins...) is provided for compatibility reasons. Instead of doing more queries, they are included based on string comparison of DNs.

After a demo with the team, I realised many people did not understand the value of the filtering, which was explictly asking the operator to fill the regexp correctly. This is a problem, as operators would be adding .* and granting access to everyone with a valid login. This changes the implementation to be more surgical: - The allowList now only includes the group names (CNs) instead of the whole DN. This is a risk in case of duplicated, but it brings readability. - The authorize part is now explicit in the token generation.

I do not fully trust our users to do the right thing and avoid to have the same group names in different parts of the organisation. If I rely on CN, it means a duplicate name could grant access incorrectly. This is a problem, and I do not want this refactor to be causing a security impact. Instead of fully rewriting the code OR ensuring that the org have no duplicate CN names, I make sure the regexp is matching on the full DN. I expect this should be cleaned in the future, as Kubernetes itself does not have this notion.

Without this, it is harder to follow the basic behaviour of some parts of the code. This is a problem for the following cases: - it is harder to debug login issues - we are losing the auditability of invalid users trying to login. - we do not see why the ldap group is not resulting in the creation of a project - we do not see areas for growth. This fixes it by bubbling up errors, or making some logs more explicit, without completely refactoring the logs.

without this, we do not handle ldap pagination. This is a problem, as large queries will return with LDAP Result Code 4 "size limit exceeded". This fixes it by handling the searchwithpaging with the controls from the results.

Without this, the code fails, because the search results exceeds the limit of 1 record.

Without this patch, the rolebindings for appops, customerops, ... are only using the old names, not the LDAP groups. This is a problem, as it will prevent us to clean up the rest of the code later. This fixes it by relying on the config of the group base to know which group to add in the rolebinding. In the future, it would be best to template those from a config map: it would help to not being directly interconnected to LDAP by having a simple config to template for the cluster. It would also be simpler to manage with a gitops tool.

There is no point to have multiple loggers and wrappers: - they need us to browse the code to realise what they do - they have 0 value compared to slog - they encouraged a laziness to not use a good structured output. This fixes it by removing zerolog and simply using slog. To do that, bubbling up errors made it easier to improve logging, as we now only handle only errors in a lot of functions, bubbling up errors to the functions in which the metrics and logs are generated. This made it also clear that the vector for the metrics were also in a weird location, while they could be near their usage for ease of understanding.

evrardjp-cagip changed the title ~~Flexible groups~~ [WIP] Flexible groups Jan 10, 2025

evrardjp-cagip force-pushed the flexible_groups branch 6 times, most recently from 181185b to 31ad7c9 Compare January 13, 2025 14:32

evrardjp-cagip changed the title ~~[WIP] Flexible groups~~ Flexible groups Jan 13, 2025

evrardjp-cagip changed the title ~~Flexible groups~~ [needs rebase, but reviewable] Flexible groups Jan 13, 2025

evrardjp-cagip added 20 commits January 17, 2025 12:50

Extract kubeConfig generation for futureproofing

4459deb

Without this we will not be able to refactor GenerateJWT and GerateConfig. This further allows testability for the config generation.

Use User type for Authenticate user

59f5ede

This allows simplification of the code.

Simplify Authentication flow

862740c

This makes it easier to reason around the getUser details.

Refactor code to be more idiomatic

3dc413f

Without this, the naming is a bit hard to follow, and the indentation is not really matching the usual go patterns. This fixes it to make the code more readable.

Do not double validate token

f35b93d

Without this, we validate the token format twice: one when doing parsewithclaims, one when we do our own validation. This serves no purposes, and introduces fragile code. This fixes it by removing the useless commits.

Rename authprovider

e9cc03b

Now that authprovider is merely doing ldap functions (it was already doing that), be explicit and call the package ldap.

Migrate middlewares outside the services folder

8751833

Services does not really means what this code does. This is a problem, as it makes the debugging tedious for a new contributor. This fixes it by moving the middlewares to their own package.

Pass User type in context, not its pointer

878bdb6

Without this, the pointer to types.User is passed to the context. This is a problem, as we do interface conversion. I fixed it by ensuring we pass values instead of changing the expected type at dest.

Remove ozzo validation

b5bcbc5

This commit tidies the go.mod after cleaning the code removing the usage of ozzo validation.

Extend logging for token issues

2cda1be

the createAccessToken method is bubbling up errors, but we never show them. This is a problem, as it prevents observability of the errors. This fixes it by logging the issues at error level.

evrardjp-cagip added 15 commits January 17, 2025 13:13

Split operator for readability

318957a

The code of the operator was in a massive single file. To improve readability, this was split into multiple files. It also took the liberty to merge the anonymous function with the handler doing approximatively the same work, as none of those were tested.

Add a few unit tests

02ebcee

While those are not technically necessary, they have helped me in my refactors. Instead of throwing them away, I will keep them here. If necessary, delete them, as they were refactor unit tests.

Cleanup TokenCounter

c1bedd5

No reason to use it in utils if it is not used in a global way.

Remove the global variables from kubeconfig

83fcdd9

The kubeconfig generation does not need to use global variables, if they are passed from issuer/global CA. This simplifies the code further to improve testability.

Make unittest more resilient

6644bdc

Without this, the unittest is caring about the order of the list, which is not relevant for this test. We simply need to make sure we have the same group names in the different cases. This fixes the test by sorting them in advance.

Handle ldap pagination

d39a8b7

without this, we do not handle ldap pagination. This is a problem, as large queries will return with LDAP Result Code 4 "size limit exceeded". This fixes it by handling the searchwithpaging with the controls from the results.

Remove sizeLimit on query groups.

b638118

Without this, the code fails, because the search results exceeds the limit of 1 record.

evrardjp-cagip force-pushed the flexible_groups branch from 3241aba to baba898 Compare January 17, 2025 12:29

This was referenced Jan 17, 2025

[DNM without first PR and rebase] Add groups into the rolebindings #60

Closed

Add "expose LDAP groups" feature #49

Closed

evrardjp-cagip changed the title ~~[needs rebase, but reviewable] Flexible groups~~ Flexible groups Jan 17, 2025

evrardjp-cagip force-pushed the flexible_groups branch from 076f86d to f35ae29 Compare January 17, 2025 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible groups #61

Flexible groups #61

evrardjp-cagip commented Jan 10, 2025 •

edited

Loading

evrardjp-cagip commented Jan 13, 2025

Flexible groups #61

Are you sure you want to change the base?

Flexible groups #61

Conversation

evrardjp-cagip commented Jan 10, 2025 • edited Loading

evrardjp-cagip commented Jan 13, 2025

evrardjp-cagip commented Jan 10, 2025 •

edited

Loading