Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Bump k8s.io and controller-runtime dependencies #10069

Merged
merged 15 commits into from
Sep 24, 2024

Conversation

timflannagan
Copy link
Member

@timflannagan timflannagan commented Sep 19, 2024

Description

This PR bumps several critical Go direct dependencies including k8s.io, go-control-plane, and controller-runtime.

Code changes

Reverts solo-io#7920 in the kube-based upstream plugin. Previously, we were typecasting our own lister abstraction that sits on top of client-go into concrete corev1 listers. That approach was breaking our regression suite -- likely because upstream refactored lister generation to be generic in 1.31.x -- and required us to revert that PR in order to get the regression suite back online. From my perspective, this type of typecasting violates our own abstraction layer and any performance-related issues should've been tackled closer to the source-of-truth (e.g. solo-kit).

Additionally, updates to pkg/bootstrap/leaderelector/kube/metrics.go were required after bumping controller-runtime to 0.18.x / 0.19.x to get builds back online.

CI changes

  • Update our k8s N-3 matrix from 1.29 to 1.31. Similarly, update our minimum k8s version from 1.25 to 1.27.
  • Update any unit tests that were comparing proto structs via To(Equal(...) type operations. This is needed due to bumping the protobuf dependency as the default, underlying sizeCache value was changed. Further investigation into why this is needed will be tackled as a follow-up

Context

This is blocking several streams of downstream work including RFE and GW API integration initiatives.

Notes for reviewers

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
Related to solo-io#9683

Sorry, something went wrong.

timflannagan and others added 2 commits September 19, 2024 18:18

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Signed-off-by: timflannagan <[email protected]>

Co-authored-by: Sam Heilbron <[email protected]>
Co-authored-by: Tyler Schade <[email protected]>
Signed-off-by: timflannagan <[email protected]>
@timflannagan timflannagan requested a review from a team as a code owner September 19, 2024 19:03
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
@solo-changelog-bot
Copy link

Issues linked to changelog:
solo-io#9683

Signed-off-by: timflannagan <[email protected]>
Note: these clients were manually generated using a solo-kit
that points to my local fork that implements kgateway-dev#564.

The gateway & gloo clients were updated to adopt recently support
for generics throughout the 1.31 client-go release. Namely, listers
and clients adopt this new approach.

The nested extauth and graphql APIs have updated hack/update-codegen.sh
bash scripts checked in with this commit, but I think we need to update
the solo-kit.json configuration for those directories since we weren't
previously committing their k8s clients.

Similarly, the "gloosnapshot" API doesn't need k8s clients generated too.

Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
Signed-off-by: timflannagan <[email protected]>
node_version='v1.29.2@sha256:51a1434a5397193442f0be2a297b488b6c919ce8a3931be0ce822606ea5ca245'
kubectl_version='v1.29.2'
kind_version='v0.20.0'
node_version='v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note: go.mod specifies the 1.31.1 patch version, but I didn't see a 1.31.1 sha image in the kind releases, so I left this here. I don't think it matters too much w.rt. patch version skew between k8s server and client versions.

@@ -1,6 +1,6 @@
node_version='v1.25.16@sha256:5da57dfc290ac3599e775e63b8b6c49c0c85d3fec771cd7d55b45fae14b38d3b'
kubectl_version='v1.25.16'
node_version='v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're jumping from 1.29 to 1.31 in Gloo, so updating the min supported k8s version to 1.27 to maintain the N-3 matrix.

@@ -0,0 +1,12 @@
changelog:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Fix this changelog.

@@ -35,3 +35,7 @@ func (s *switchAdapter) On(name string) {
func (s *switchAdapter) Off(name string) {
s.gauge.WithLabelValues(name).Set(0.0)
}

func (s *switchAdapter) SlowpathExercised(name string) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for controller-runtime 0.18.x due to client-go leaderelection changes.

@@ -10,10 +10,9 @@ ROOT_PKG=github.com/solo-io/gloo/projects/gateway/pkg/api/v1
CLIENT_PKG=${ROOT_PKG}/kube/client
APIS_PKG=${ROOT_PKG}/kube/apis

# Below code is copied from https://github.com/weaveworks/flagger/blob/master/hack/update-codegen.sh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is generated by solo-kit. We updated the Go template file in solo-io/solo-kit#560. Note, this file is technically bugged and I have an open PR for fixing this in solo-io/solo-kit#564.

@@ -65,7 +66,7 @@ var _ = Describe("RetryOnUnavailableClientConstructor", func() {
// sanity check
resp, err := client.Validate(rootCtx, &validation.GlooValidationServiceRequest{})
Expect(err).NotTo(HaveOccurred())
Expect(resp).To(Equal(res))
Expect(resp).To(matchers.MatchProto(res))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed due to the protobuf jump that was clumped with these dependency bumps.

Comment on lines +102 to +108
Controller: config.Controller{
// see https://github.com/kubernetes-sigs/controller-runtime/issues/2937
// in short, our tests reuse the same name (reasonably so) and the controller-runtime
// package does not reset the stack of controller names between tests, so we disable
// the name validation here.
SkipNameValidation: ptr.To(true),
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment calls out why this is needed, but this due to a recent c-r change that enforces stricter validation for controller names.

"gen_kube_types": true,
"gen_kube_types": false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to solo-io#10079. For context, solo-kit was generating hack/*-codegen.sh bash scripts for these nested directories that were relevant, so toggling this off / removing this option helps us manage maintenance.

Copy link
Member Author

@timflannagan timflannagan Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note on these new uniquehash.go files. This is needed as we had a bug in GME's AccessPolicy caching that required us to introduce a new primitive in this library. See https://github.com/solo-io/gloo-mesh-enterprise/pull/17392 for more information. We aren't using this new method, but still wanted to provide context on these generated files as we're bumping skv2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff in these client-gen generated files is a bit confusing. Basically, client-go had a series of improvements in 1.31 to help adopt generics and cut down on the amount of generated code for consumers of this library. The gentype package defines the common Get/List/etc. interfaces now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comment about client-go generated code, client-go refactored the listers implementation to adopt a generics-based approach. See kubernetes/kubernetes#121574 for more information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to highlight this change in the sea of generated code changes. The primary change here is the removal of typecasting to corev1 listers which was causing the regression suite to fail. IMO, doing this is a violation of our own lister abstraction (that manages corev1 listers under-the-hood) and any net new issues with performance regressions could be handled as a follow-up in solo-kit.

Comment on lines +54 to +56
EnableGatewayController: &wrappers.BoolValue{
Value: true,
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to confirm with Tyler or Sam why this change is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, let's chat about this, I have some context and questions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We followed up in Slack. We discussed how this is due to some unknown proto changes that affect boolean values to be overridden in our tests. This impacts just this test, because we require the EnableGatewayController (edge gw) to be true in Settings, but since we define some other values in the same struct, the default true value is not being respected, and instead the overriding empty value is being used so it is false.

Our plan is two-fold:

  1. Keep this temporary solution to merge the large code. This way this PR doesn't go out of date
  2. Immediately after, investigate what proto changes could lead to this and provide an explanation and fix

cc @timflannagan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as a TODO on the parent issue as well.

@sam-heilbron sam-heilbron added the work in progress signals bulldozer to keep pr open (don't auto-merge) label Sep 23, 2024
@sam-heilbron
Copy link
Contributor

Adding the work in progress label to prevent auto-merging, while we confirm this can be pulled into enterprise seamlessly

Copy link
Contributor

@jmhbh jmhbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small nit and a quick q. Thanks for working on this Tim!

@@ -67,12 +67,12 @@ jobs:

# September 16, 2024: 21 minutes
- cluster-name: 'cluster-three'
go-test-args: '-v -timeout=25m'
go-test-args: '-v -timeout=30m'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious why the timeout bump is necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-heilbron Any idea on why this was necessary?

@jmhbh
Copy link
Contributor

jmhbh commented Sep 24, 2024

Adding the work in progress label to prevent auto-merging, while we confirm this can be pulled into enterprise seamlessly

Created this branch k8s-1.31-bump to test the enterprise integration.

@timflannagan
Copy link
Member Author

Created this branch k8s-1.31-bump to test the enterprise integration.

Just saw this now. I opened solo-io#10069 earlier today to work through all the potential issues.

@timflannagan timflannagan removed the work in progress signals bulldozer to keep pr open (don't auto-merge) label Sep 24, 2024
@soloio-bulldozer soloio-bulldozer bot merged commit c831171 into kgateway-dev:main Sep 24, 2024
18 checks passed
@timflannagan timflannagan deleted the chore/k8s-1-31 branch September 24, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants