Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An automated debugging tooling for the Eventing #6145

Closed
cardil opened this issue Feb 14, 2022 · 3 comments
Closed

An automated debugging tooling for the Eventing #6145

cardil opened this issue Feb 14, 2022 · 3 comments
Assignees

Comments

@cardil
Copy link
Contributor

cardil commented Feb 14, 2022

Problem
We see the data loss in Eventing components from time to time, unfortunately. Here are some of notable events (knative-extensions/eventing-kafka#649, knative-extensions/eventing-kafka#549, #2357).

Currently, we see failures in the CI for upcoming OpenShift 4.10 in KafkaSource upgrade tests. Each time we encounter data loss, we struggle to pinpoint the failure point. That's because the Eventing lacks a dedicated tooling that could help debug such situation.

Persona:
Developer

Exit Criteria

  • Dedicated tooling should provide clear signal why the expected events haven't been delivered.
    • We expect to see the report with events that were dropped, and the place of such drop, with timing. Essentailly, a distributed tracing spans of missed events.
  • It should be easy to set up - "Track this namespace, for events of type XXX and YYY"
    • In order to embed within test-suites, and external tools like kn-trace.
  • It needs to support filtering by fields, especially source field
    • when running tests in parallel it is crucial to debug only specific event streams

Time Estimate (optional):
10d

Additional context (optional)
End users, who'd like to extend the Eventing with their own implementations, could also suffer from this issue. It would be good to have such tooling in that case.

Running more tests with continual traffic, like wathola tooling does, could uncover additional data loss bugs. In the future, it's possible to run some chaos/soak/failover scenarios with such assertions. The debugging tooling could be helpful in that case as well.

@cardil
Copy link
Contributor Author

cardil commented Feb 14, 2022

My general idea for such tooling would be something like:

  1. Deploy the tracing extension, configure it to collect spans in the central store (Zipkin, or Jaeger).
  2. Send tracing information from wathola-sender
  3. If data lost is reported, automatically search the span store for missing events and produce a report.

@cardil
Copy link
Contributor Author

cardil commented Apr 1, 2022

I think the #6249 and #6219 fixes this issue in full. Closing as done.

/assign @mgencur
/close

@knative-prow
Copy link

knative-prow bot commented Apr 1, 2022

@cardil: Closing this issue.

In response to this:

I think the #6249 and #6219 fixes this issue in full. Closing as done.

/assign @mgencur
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants