-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up & deflake tests by disabling bundling #1689
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: evankanderson The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @vagababov since I know this was driving him crazy -- any additional thoughts on tests? $ go test ./metrics -count 40 -timeout 900s 2>&1 > /tmp/test-log
$ cat /tmp/test-log
ok knative.dev/pkg/metrics 208.145s (I didn't push too much past |
Fixing th e races now, forgot to run with |
c40b6ae
to
263b86c
Compare
Note that at head, I've noticed that Prow test runs seem to be much slower than my local machine, so it's possible the timeouts are due to the slower processing time on the Prow cluster. |
/retest In hopes of getting another Prow run on this. |
metrics/resource_view_test.go
Outdated
records = append(records, extracted) | ||
if strings.HasPrefix(ts.Metric.Type, "knative.dev/") { | ||
if diff := cmp.Diff(ts.Resource.Type, metricskey.ResourceTypeKnativeRevision); diff != "" { | ||
t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff) | |
t.Errorf("Incorrect resource type for %q: (-want +got):\n%s", ts.Metric.Type, diff) |
metrics/resource_view_test.go
Outdated
sdFake.srv.GracefulStop() | ||
break loop | ||
} | ||
case <-time.After(5 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should those timeouts be consistent? it's 4s above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There used to be an extra delay on the Stackdriver outputs, but they should be similar now.
|
||
// A variable for testing to reduce the size (number of metrics) buffered before | ||
// Stackdriver will send a bundled metric report. Only applies if non-zero. | ||
TestOverrideBundleCount = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a public variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed like e2e tests in Serving or Eventing might want to be able to disable the batching for the same reasons that the metrics tests would want to do so.
I'd like to add tests for important exports to the Serving and Eventing repos once these tests are no longer flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure then let's specify that, since just looking at the comment I see no reason why it should be public.
Also let's do this in a linter happy format TestOverrideBundleCount is a variable...
Updated! |
The following is the coverage report on the affected files.
|
@@ -225,9 +226,10 @@ func sortMetrics() cmp.Option { | |||
|
|||
// Begin table tests for exporters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Begin table tests for exporters |
break loop | ||
} | ||
case <-time.After(4 * time.Second): | ||
t.Error("Timeout reading input") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if fatal is more appropriate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
select { | ||
case record := <-ocFake.published: | ||
if record == nil { | ||
continue loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not important here?
continue loop | |
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you need the label, otherwise the continue
applies to the select
, rather than to the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if it's a lambda you return from the func rather than continue. But since we're keeping the label, it doesn't matter.
labels := map[string]string{} | ||
if record.Resource != nil { | ||
labels = record.Resource.Labels | ||
loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically I'd prefer a lambda with return rather than goto label, but up to you. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would look like:
for {
breakErr := error.New("intentional break")
err := func() error {
select {
case record := <-ocFake.published:
if record == nil {
return nil
}
// ...
if len(records) >= len(expected) {
return breakErr
}
case <-time.After(5 * time.Second):
return errors.New("timeout")
}
}()
if err != breakErr {
t.Fatal("Unable to fetch events", err)
}
if err == breakErr {
break
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with the label is better, stylistically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you :-)
Value: ts.Points[0].Value.GetInt64Value(), | ||
} | ||
// Override 'cluster-name' label to reset to a fixed value | ||
if extracted.Labels["cluster_name"] != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it not be set if empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using Golang defaults, so mapVal["doesNotExist"]
will return the default value from the map map[string]string
, which is ""
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly theoretically there may be the value in the map with ""
value. So if this happens due to a bug it won't be caught. But 🤷 .
/lgtm |
Fixes #1672
It turns out that both Stackdriver and OpenCensus have rpc Bundlers that hold reports until at least N have been accumulated or for 1-2s. I had to send an upstream PR to OpenCensus to expose the bundler options in order to speed up reporting, which accounts for the
go.mod
update.It also turns out that gRPC will take (on my machine, going through ??? WSL2 -> Windows DNS -> internet) about 1s to attempt to resolve
external-svc
, so changing that to a name that will resolve sped up theconfig_test
by a second or so. It also turns out that gRPC takes an additional 1-2 seconds to set up "fail-fast" channels on Windows but not on a Linux kernel -- caveat testor if you're attempting to debug timing on Windows.I think this has been de-flaked from a timeout perspective; on my machine (i7 mobile processor, plugged in, running WSL2), the entire suite takes about 5s, compared with 10s before these changes.