Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove port-forwarding from Kubernetes E2E tests #9575

Closed
wants to merge 68 commits into from

Conversation

jbohanon
Copy link
Contributor

@jbohanon jbohanon commented Jun 4, 2024

Description

Port forwarding is not a pattern that is used in user deployments, but is a quick hack to get networking into a cluster. Since our intent with Kubernetes E2E tests is to imitate user patterns, we opt to eliminate the port forwards from the Kubernetes E2E tests altogether.

Note: for the specific use case of accessing envoy admin endpoints, this is not a particularly user-facing behavior anyway, so we have chosen to run these as a curl from within the proxy pod

The port-forwarding utility did not by default wait for the port-forward to be available before returning non-nil. This PR adds the dial logic to the utility.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@solo-changelog-bot
Copy link

Issues linked to changelog:
solo-io#9353

@github-actions github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label Jun 4, 2024
@jbohanon
Copy link
Contributor Author

jbohanon commented Jun 4, 2024

/kick

Copy link

github-actions bot commented Jun 4, 2024

Visit the preview URL for this PR (updated for commit a392448):

https://gloo-edge--pr9575-jbohanon-port-forwar-vd8cg4zt.web.app

(expires Fri, 21 Jun 2024 17:00:42 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

Sign: 77c2b86e287749579b7ff9cadb81e099042ef677

@jbohanon jbohanon requested review from sam-heilbron and jenshu June 4, 2024 16:55
sam-heilbron
sam-heilbron previously approved these changes Jun 4, 2024
jenshu
jenshu previously approved these changes Jun 4, 2024
@jbohanon jbohanon dismissed stale reviews from jenshu and sam-heilbron via 0dbb5d4 June 4, 2024 18:14
sam-heilbron
sam-heilbron previously approved these changes Jun 4, 2024
@jbohanon jbohanon added the work in progress signals bulldozer to keep pr open (don't auto-merge) label Jun 4, 2024
@jbohanon
Copy link
Contributor Author

jbohanon commented Jun 4, 2024

work in progress label added to investigate CI-only failure in Kubernetes End-to-End cluster-two

Removal criteria:

  • Revert changes in test selection
  • Remove CI skips in changelog
  • Actually passing tests

Stdout: params.Stdout,
Stderr: params.Stderr,
Env: params.Env,
Args: []string{"describe", "pod", "-l", label},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add namespace

}

func KubeLogs(ctx context.Context, label string, params Params) string {
params.Args = []string{"logs", "-l", label}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add namespace

Copy link
Contributor

@sam-heilbron sam-heilbron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I haven't looked too in depth at the implementation, but I shared some high level thoughts. Could we make sure the following is completed:

  • This sentiment around port-forwarding should be captured in our e2e readme
  • The PR body should clarify why we believe this will reduce flakes and what approvers can do to validate that
  • There is a lot of new code, how will we prevent developers from calling the wrong code at hte wrong time?

@@ -0,0 +1,193 @@
package kubectl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of placing a new "hack" directory into our testutils code. It feels like we're creating a potential to add code that will never leave. Why not add the utilities directly to the places we want, and clarify in the code that we would like to progressively migrate some of these components to a shared library when they are stable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that having a hack directory in the code is a huge smell that creates an underlying tension indicating this shouldn't be here. I would think that this code would be more likely to assimilate itself into the codebase and never end up being extracted back out if we interspersed it with permanent code. I would rather deal with the pain of inter-repo PRs than risk a bespoke Kubectl function staying in this repo for no reason other than it hasn't been looked at in months.

In either case I think an issue should be opened to track migrating the code back to the shared util.

"strings"

"github.com/solo-io/gloo/pkg/utils/cmdutils"
"github.com/solo-io/gloo/pkg/utils/envoyutils/admincli"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different than the admincli in the envoyutils package? If this is a better place for that code, can we just move it all to live in the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably use some rebranding/relocation to indicate that it is a utility for accessing the admin client from inside the cluster. I could possibly refactor the admincli to take a Cmder to avoid so much code duplication but that seemed like an even deeper bite than I had already committed to

)

type EnvoyAdminAssertion func(ctx context.Context, adminClient *envoyadmin.Client, clientRefresh func(ctx context.Context, adminClient *envoyadmin.Client, expectedReplicas int) *envoyadmin.Client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of dependency injection, though do you think the clientRefresh will have multiple implementations? If anything, it feels like an implementation detail that only exists because of a subtle behavior. It feels like we'd be better suited in fact to "hide" this, ie not expose it to users, and require they use the single implementation. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest thing that needs to be injected on a per-case basis is the required number of replicas. Since the status reporting for pods that have had a termination event raised against them by the kubelet is garbage (no way to tell without querying events if a pod is "terminating"; it still shows as ready), the way I figured out how to select a truly ready pod was to wait for the number of replicas for the deployment to stabilize on a known value. I agree that it's pretty smelly though; maybe I could look at passing the expected replicas into the assertion.

Though as I am typing this up I remember that the main reason it was injected was so that the asstertion could call the refresh within its Eventually block. Keeping the client stabilization as close to the call that needs a stable client as possible.

It is truly awful that a pod in Terminating state does not have a status condition stating such

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to kubernetes/kubernetes#102537 this is expected behavior

cmd.Stdout = io.MultiWriter(cmd.Stdout, &combinedOutput)
cmd.Stderr = io.MultiWriter(cmd.Stderr, &combinedOutput)

_, err := kube.ExecFromEphemeralPod(context.Background(), cmd.EphemeralPodParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this entirely. What is the value in making a kube specific command, versus just invoking the kube specific method on the client?

The original intention of the LocalCmd, was to make it so you could have a shared implementation of cmd.Exec, and clients could use that under the hood. It feels like we've inverted that relationship and now calling code must know a command is being used, and isn't aware that a kube.ExecFromEphemeralPod is being called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't argue original intent since you wrote the code, but the shape of the code implies to me that cmdutils.Cmd was abstracting away the implementation of exec.Cmd as LocalCmd. The very existence of LocalCmd to me implies a RemoteCmd implementation was expected eventually.

I think the benefit of this abstraction is that devs can use context-specific Cmds in ways that have similar ergonomics. If you're running in-memory e2e or unit tests, a LocalCmd is what you need, while if you're running in Kube, a RemoteCmd is what you need. The latter requires a slight bit more setup on the Cmder but other than that, they're wielded in the same way with regard to setting up the input/output streams and calling Run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value in making a kube specific command, versus just invoking the kube specific method on the client?

A RemoteCmd, just like a LocalCmd, will I expect be primarily used by clients. This code primarily enables the remote envoy admin client in test/kubernetes/testutils/cmd/envoyadmin/envoyadmin.go and the helper wrapper for a remote curl in test/kubernetes/testutils/cmd/curl.go, which to me is analogous to the LocalCmd being used by the admincli and curl helper clients

}
)

func CurlWithEphemeralPod(ctx context.Context, logger io.Writer, kubeContext, fromns, frompod string, args ...string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building from my comment about the introduction of the hack directory, I think if we want to introduce this into the codebase, we can/should just place this on the kube.Cli that we created and perhaps clarify it was copy/pasted from a go-utils@commit-sha

Copy link
Contributor

@bewebi bewebi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense and looks good, just a few nits/questions

pkg/utils/kubeutils/kubectl/cli.go Outdated Show resolved Hide resolved
@@ -55,8 +55,19 @@ type apiPortForwarder struct {
}

func (f *apiPortForwarder) Start(ctx context.Context, options ...retry.Option) error {
return retry.Do(func() error {
if err := retry.Do(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments in this function about what each step is?
I think I follow, but I'm not sure, and it will help for others

@@ -38,8 +38,19 @@ type cliPortForwarder struct {
}

func (c *cliPortForwarder) Start(ctx context.Context, options ...retry.Option) error {
return retry.Do(func() error {
if err := retry.Do(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add comments here?

test/kubernetes/testutils/assertions/envoy.go Outdated Show resolved Hide resolved
go func() {
for {
select {
case <-time.After(time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for using one second as the polling interval?
It seems high to me, but might make sense depending on the use case

Comment on lines 127 to 134
finished := func(output string) bool {
return strings.Contains(output, "Running") || strings.Contains(output, "ContainerCreating")
}
for _, label := range labels {
if err := WaitPodStatus(ctx, interval, namespace, label, "Running or ContainerCreating", finished, params); err != nil {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we first wait for Running or ContainerCreating and then wait for just Running?
Logically this seems equivalent to simply waiting for all to be Running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say for certain, and I'm not even sure if the reason this was originally done is still relevant, but this was a deliberate change made >5 years ago by Marco in https://github.com/solo-io/go-utils/pull/65/files#diff-194e17bf80495c27c08438d8dda9585f21023dfeb92a8910adb4294d58d01b39L103. I'd tag him to ask why, but it was over 5 years ago, so it's unlikely he remembers I think. If you think it's worth pursuing though I can

})
}

func Curl(ctx context.Context, logger io.Writer, kubeContext, ns, fromDeployment, fromContainer, url string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add the ability to pass additional curl args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in test/testutils/hack was copied† from other shared repos. The intent is for this code to migrate back to the shared repos eventually since this code is actually used by some other projects in the org. That said, unless we wanted to move the entire curl util we have written over to go-utils or k8s-utils, then I don't think we can support that here.

†: There were modifications made to the code, the roots of which can be seen in these two PRs: solo-io/go-utils#527 solo-io/k8s-utils#51

@jbohanon jbohanon added the work in progress signals bulldozer to keep pr open (don't auto-merge) label Jun 14, 2024
@jbohanon jbohanon marked this pull request as draft June 14, 2024 17:55
@jbohanon jbohanon marked this pull request as draft June 14, 2024 17:55
@jbohanon
Copy link
Contributor Author

This PR is being placed on hold in favor of a less intrusive fix to the linked issue. The fundamental issue with the test that was flaking is that it is possible to select a pod in Terminating state for port-forward. Adding retries to the port-forward and making sure that we aren't selecting a pod in Terminating status should be sufficient to solve the stated problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep pr updated signals bulldozer to keep pr up to date with base branch work in progress signals bulldozer to keep pr open (don't auto-merge)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants