-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove port-forwarding from Kubernetes E2E tests #9575
Conversation
Issues linked to changelog: |
/kick |
Visit the preview URL for this PR (updated for commit a392448): https://gloo-edge--pr9575-jbohanon-port-forwar-vd8cg4zt.web.app (expires Fri, 21 Jun 2024 17:00:42 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 Sign: 77c2b86e287749579b7ff9cadb81e099042ef677 |
…io/gloo into jbohanon/port-forward-safety
…io/gloo into jbohanon/port-forward-safety
work in progress label added to investigate CI-only failure in Kubernetes End-to-End cluster-two Removal criteria:
|
Stdout: params.Stdout, | ||
Stderr: params.Stderr, | ||
Env: params.Env, | ||
Args: []string{"describe", "pod", "-l", label}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add namespace
} | ||
|
||
func KubeLogs(ctx context.Context, label string, params Params) string { | ||
params.Args = []string{"logs", "-l", label} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I haven't looked too in depth at the implementation, but I shared some high level thoughts. Could we make sure the following is completed:
- This sentiment around port-forwarding should be captured in our e2e readme
- The PR body should clarify why we believe this will reduce flakes and what approvers can do to validate that
- There is a lot of new code, how will we prevent developers from calling the wrong code at hte wrong time?
@@ -0,0 +1,193 @@ | |||
package kubectl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of placing a new "hack" directory into our testutils code. It feels like we're creating a potential to add code that will never leave. Why not add the utilities directly to the places we want, and clarify in the code that we would like to progressively migrate some of these components to a shared library when they are stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that having a hack
directory in the code is a huge smell that creates an underlying tension indicating this shouldn't be here. I would think that this code would be more likely to assimilate itself into the codebase and never end up being extracted back out if we interspersed it with permanent code. I would rather deal with the pain of inter-repo PRs than risk a bespoke Kubectl
function staying in this repo for no reason other than it hasn't been looked at in months.
In either case I think an issue should be opened to track migrating the code back to the shared util.
"strings" | ||
|
||
"github.com/solo-io/gloo/pkg/utils/cmdutils" | ||
"github.com/solo-io/gloo/pkg/utils/envoyutils/admincli" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different than the admincli in the envoyutils package? If this is a better place for that code, can we just move it all to live in the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably use some rebranding/relocation to indicate that it is a utility for accessing the admin client from inside the cluster. I could possibly refactor the admincli to take a Cmder
to avoid so much code duplication but that seemed like an even deeper bite than I had already committed to
) | ||
|
||
type EnvoyAdminAssertion func(ctx context.Context, adminClient *envoyadmin.Client, clientRefresh func(ctx context.Context, adminClient *envoyadmin.Client, expectedReplicas int) *envoyadmin.Client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of dependency injection, though do you think the clientRefresh
will have multiple implementations? If anything, it feels like an implementation detail that only exists because of a subtle behavior. It feels like we'd be better suited in fact to "hide" this, ie not expose it to users, and require they use the single implementation. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest thing that needs to be injected on a per-case basis is the required number of replicas. Since the status reporting for pods that have had a termination event raised against them by the kubelet is garbage (no way to tell without querying events if a pod is "terminating"; it still shows as ready), the way I figured out how to select a truly ready pod was to wait for the number of replicas for the deployment to stabilize on a known value. I agree that it's pretty smelly though; maybe I could look at passing the expected replicas into the assertion.
Though as I am typing this up I remember that the main reason it was injected was so that the asstertion could call the refresh within its Eventually block. Keeping the client stabilization as close to the call that needs a stable client as possible.
It is truly awful that a pod in Terminating state does not have a status condition stating such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to kubernetes/kubernetes#102537 this is expected behavior
cmd.Stdout = io.MultiWriter(cmd.Stdout, &combinedOutput) | ||
cmd.Stderr = io.MultiWriter(cmd.Stderr, &combinedOutput) | ||
|
||
_, err := kube.ExecFromEphemeralPod(context.Background(), cmd.EphemeralPodParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow this entirely. What is the value in making a kube specific command, versus just invoking the kube specific method on the client?
The original intention of the LocalCmd, was to make it so you could have a shared implementation of cmd.Exec, and clients could use that under the hood. It feels like we've inverted that relationship and now calling code must know a command is being used, and isn't aware that a kube.ExecFromEphemeralPod is being called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't argue original intent since you wrote the code, but the shape of the code implies to me that cmdutils.Cmd
was abstracting away the implementation of exec.Cmd
as LocalCmd
. The very existence of LocalCmd
to me implies a RemoteCmd
implementation was expected eventually.
I think the benefit of this abstraction is that devs can use context-specific Cmd
s in ways that have similar ergonomics. If you're running in-memory e2e or unit tests, a LocalCmd
is what you need, while if you're running in Kube, a RemoteCmd
is what you need. The latter requires a slight bit more setup on the Cmder
but other than that, they're wielded in the same way with regard to setting up the input/output streams and calling Run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value in making a kube specific command, versus just invoking the kube specific method on the client?
A RemoteCmd
, just like a LocalCmd
, will I expect be primarily used by clients. This code primarily enables the remote envoy admin client in test/kubernetes/testutils/cmd/envoyadmin/envoyadmin.go
and the helper wrapper for a remote curl in test/kubernetes/testutils/cmd/curl.go
, which to me is analogous to the LocalCmd
being used by the admincli
and curl
helper clients
} | ||
) | ||
|
||
func CurlWithEphemeralPod(ctx context.Context, logger io.Writer, kubeContext, fromns, frompod string, args ...string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building from my comment about the introduction of the hack
directory, I think if we want to introduce this into the codebase, we can/should just place this on the kube.Cli that we created and perhaps clarify it was copy/pasted from a go-utils@commit-sha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall makes sense and looks good, just a few nits/questions
@@ -55,8 +55,19 @@ type apiPortForwarder struct { | |||
} | |||
|
|||
func (f *apiPortForwarder) Start(ctx context.Context, options ...retry.Option) error { | |||
return retry.Do(func() error { | |||
if err := retry.Do(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add comments in this function about what each step is?
I think I follow, but I'm not sure, and it will help for others
@@ -38,8 +38,19 @@ type cliPortForwarder struct { | |||
} | |||
|
|||
func (c *cliPortForwarder) Start(ctx context.Context, options ...retry.Option) error { | |||
return retry.Do(func() error { | |||
if err := retry.Do(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add comments here?
go func() { | ||
for { | ||
select { | ||
case <-time.After(time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for using one second as the polling interval?
It seems high to me, but might make sense depending on the use case
finished := func(output string) bool { | ||
return strings.Contains(output, "Running") || strings.Contains(output, "ContainerCreating") | ||
} | ||
for _, label := range labels { | ||
if err := WaitPodStatus(ctx, interval, namespace, label, "Running or ContainerCreating", finished, params); err != nil { | ||
return err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we first wait for Running or ContainerCreating and then wait for just Running?
Logically this seems equivalent to simply waiting for all to be Running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say for certain, and I'm not even sure if the reason this was originally done is still relevant, but this was a deliberate change made >5 years ago by Marco in https://github.com/solo-io/go-utils/pull/65/files#diff-194e17bf80495c27c08438d8dda9585f21023dfeb92a8910adb4294d58d01b39L103. I'd tag him to ask why, but it was over 5 years ago, so it's unlikely he remembers I think. If you think it's worth pursuing though I can
}) | ||
} | ||
|
||
func Curl(ctx context.Context, logger io.Writer, kubeContext, ns, fromDeployment, fromContainer, url string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add the ability to pass additional curl args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything in test/testutils/hack
was copied† from other shared repos. The intent is for this code to migrate back to the shared repos eventually since this code is actually used by some other projects in the org. That said, unless we wanted to move the entire curl
util we have written over to go-utils
or k8s-utils
, then I don't think we can support that here.
†: There were modifications made to the code, the roots of which can be seen in these two PRs: solo-io/go-utils#527 solo-io/k8s-utils#51
Co-authored-by: Bernie Birnbaum <[email protected]>
This PR is being placed on hold in favor of a less intrusive fix to the linked issue. The fundamental issue with the test that was flaking is that it is possible to select a pod in |
Description
Port forwarding is not a pattern that is used in user deployments, but is a quick hack to get networking into a cluster. Since our intent with Kubernetes E2E tests is to imitate user patterns, we opt to eliminate the port forwards from the Kubernetes E2E tests altogether.
Note: for the specific use case of accessing envoy admin endpoints, this is not a particularly user-facing behavior anyway, so we have chosen to run these as a curl from within the proxy pod
The port-forwarding utility did not by default wait for the port-forward to be available before returning non-nil. This PR adds the dial logic to the utility.
Checklist: