-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print traces for missed events in upgrade tests #6249
Print traces for missed events in upgrade tests #6249
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mgencur The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
eceb1b3
to
9da3205
Compare
This log shows printing a Zipkin trace for an event that is missed (the link points exactly to the line where the trace is printed) The problem is that I had to increase the interval between events to 500ms (from 10ms) to prevent dropping trace spans because the in-memory Zipkin database doesn't handle such a big load and goes OutOfMemory several times during the regular test. This was just to demonstrate the missed event and to get the trace from Zipkin. The change will not be part of this PR. We need to discuss possible options to resolve this. |
Codecov Report
@@ Coverage Diff @@
## main #6249 +/- ##
=======================================
Coverage 82.18% 82.18%
=======================================
Files 231 231
Lines 7787 7787
=======================================
Hits 6400 6400
Misses 937 937
Partials 450 450 Continue to review full report at Codecov.
|
The output is meant to be readable from the console, I've converted the http response from Zipkin into a "trace tree" and reused the tree API from conformance tests. No sure if it's necessary to have GUI for reading the trace, it doesn't give you much more except for searching capabilities, but this is already a single trace (no search needed).
Then later, the user would do this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this PR!
I'm wondering if we can make it a library that would work in any rekt tests out of the box
@pierDipi thanks ! :) Making it a library would probably make sense. Let me think about it. |
I would like to have the span saved as an artifact, to later inspect, as you've described @mgencur. I think reading the span in text format is cumbersome. Loading it to GUI makes more sense to me. |
The message could be:
|
@cardil I guess the logic for getting the trace will need to be moved from Receiver (in-cluster) to Prober (localhost). And so the message will probably don't include anything about the trace. The Prober can later get traces for both Step events (based on |
Pushed changes for exporting traces into files. |
I have figured out the memory settings for Zipkin which works for both upgrade tests and non-upgrade Github tests. The Zipkin pod is now able to store all traces and the upgrade tests properly get the trace at the end as can be seen in this run |
3d97325
to
95ca3c8
Compare
68bca22
to
2fec7c7
Compare
This reverts commit 2fec7c7.
Latest run which shows exporting a trace for a missed event. |
The failure in downstream tests for eventing-rabbitmq seems to be unrelated to this PR. |
test/upgrade/prober/verify.go
Outdated
func (p *prober) getTraceForStepEvent(eventNo string) []byte { | ||
query := fmt.Sprintf("step=%s and cloudevents.type=%s and target=%s", | ||
eventNo, event.StepType, fmt.Sprintf(forwarderTargetFmt, p.client.Namespace)) | ||
trace, err := event.FindTrace(query) | ||
if err != nil { | ||
p.log.Warn(err) | ||
} | ||
return trace | ||
} | ||
|
||
func (p *prober) getTraceForFinishedEvent() []byte { | ||
query := fmt.Sprintf("cloudevents.type=%s and target=%s", | ||
event.FinishedType, fmt.Sprintf(forwarderTargetFmt, p.client.Namespace)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these queries work across source, broker, channel implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should work because these Tags are added by Wathola sender when generating the events: link.
They're basically added at the beginning of the Trace and the queries return all traces that have these tags.
Right now it should work anywhere where the http sender is used because it properly passes the Context (propagates the trace). I still want to add it to the KafkaSender here (I have the code locally but waiting for the knative/eventing changes to propagate to that repo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cardil The query should return all traces for step/finished events with the given step/type as long as the event was not delivered in the end. For a missed event there might by multiple traces included in the report if the sender was trying to re-deliver them and it failed. And the Zipkin UI should be able to display them all.
I am not saving traces for duplicated events (as long as they were delivered) so I'm sending another commit that should fix it. Testing it now...
34f620d
to
c126521
Compare
@cardil Pushed some minor fixes. The previous run had both a missed event and duplicate events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍 Looks great!
/lgtm
#### Inspecting Zipkin traces for undelivered events | ||
|
||
When tracing is enabled in the `config-tracing` config map in the system namespace | ||
the prober collects traces for undelivered events. The traces are exported as json files | ||
under the artifacts dir. Traces for each event are stored in a separate file. | ||
Step event traces are stored as `$ARTIFACTS/traces/missed-events/step-<step_number>.json` | ||
The finished event traces are stored as `$ARTIFACTS/traces/missed-events/finished.json` | ||
|
||
Traces can be viewed as follows: | ||
- Start a Zipkin container on localhost: | ||
``` | ||
$ docker run -d -p 9411:9411 ghcr.io/openzipkin/zipkin:2 | ||
``` | ||
- Send traces to the Zipkin endpoint: | ||
``` | ||
$ curl -v -X POST localhost:9411/api/v2/spans \ | ||
-H 'Content-Type: application/json' \ | ||
-d @$ARTIFACTS/traces/missed-events/step-<step_number>.json | ||
``` | ||
- View traces in Zipkin UI at `http://localhost:9411/zipkin` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cardil, mgencur The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185.
* Wathola Tracing for upgrade tests (#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (#6289) * Print traces for missed events in upgrade tests (#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (#6331) * NPE fix (#6343) Co-authored-by: Chris Suszynski <[email protected]>
* Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185.
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (#6289) * Print traces for missed events in upgrade tests (#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (#6331) * NPE fix (#6343) Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]> Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]> Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]>
This builds on top of #6219
Fixes point 3. from #6145 (comment)
Proposed Changes
step-<num>.json
. Finished event is stored infinished.json
undertraces/missed-events/
Pre-review Checklist
Release Note
Docs