An automated debugging tooling for the Eventing #6145

cardil · 2022-02-14T20:06:52Z

Problem
We see the data loss in Eventing components from time to time, unfortunately. Here are some of notable events (knative-extensions/eventing-kafka#649, knative-extensions/eventing-kafka#549, #2357).

Currently, we see failures in the CI for upcoming OpenShift 4.10 in KafkaSource upgrade tests. Each time we encounter data loss, we struggle to pinpoint the failure point. That's because the Eventing lacks a dedicated tooling that could help debug such situation.

Persona:
Developer

Exit Criteria

Dedicated tooling should provide clear signal why the expected events haven't been delivered.
- We expect to see the report with events that were dropped, and the place of such drop, with timing. Essentailly, a distributed tracing spans of missed events.
It should be easy to set up - "Track this namespace, for events of type XXX and YYY"
- In order to embed within test-suites, and external tools like kn-trace.
It needs to support filtering by fields, especially source field
- when running tests in parallel it is crucial to debug only specific event streams

Time Estimate (optional):
10d

Additional context (optional)
End users, who'd like to extend the Eventing with their own implementations, could also suffer from this issue. It would be good to have such tooling in that case.

Running more tests with continual traffic, like wathola tooling does, could uncover additional data loss bugs. In the future, it's possible to run some chaos/soak/failover scenarios with such assertions. The debugging tooling could be helpful in that case as well.

cardil · 2022-02-14T20:09:50Z

My general idea for such tooling would be something like:

Deploy the tracing extension, configure it to collect spans in the central store (Zipkin, or Jaeger).
Send tracing information from wathola-sender
If data lost is reported, automatically search the span store for missing events and produce a report.

cardil · 2022-04-01T09:18:26Z

I think the #6249 and #6219 fixes this issue in full. Closing as done.

/assign @mgencur
/close

knative-prow · 2022-04-01T09:18:34Z

@cardil: Closing this issue.

In response to this:

I think the #6249 and #6219 fixes this issue in full. Closing as done.

/assign @mgencur
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cardil added the kind/feature-request label Feb 14, 2022

mgencur mentioned this issue Mar 2, 2022

Wathola Tracing for upgrade tests #6219

Merged

5 tasks

mgencur mentioned this issue Mar 9, 2022

Print traces for missed events in upgrade tests #6249

Merged

5 tasks

knative-prow bot assigned mgencur Apr 1, 2022

knative-prow bot closed this as completed Apr 1, 2022

mgencur mentioned this issue Aug 3, 2022

Gather traces for events sent by the framework knative-extensions/reconciler-test#367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An automated debugging tooling for the Eventing #6145

An automated debugging tooling for the Eventing #6145

cardil commented Feb 14, 2022

cardil commented Feb 14, 2022

cardil commented Apr 1, 2022

knative-prow bot commented Apr 1, 2022

An automated debugging tooling for the Eventing #6145

An automated debugging tooling for the Eventing #6145

Comments

cardil commented Feb 14, 2022

cardil commented Feb 14, 2022

cardil commented Apr 1, 2022

knative-prow bot commented Apr 1, 2022