Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework tracing slog sender #10799

Open
mhofman opened this issue Jan 4, 2025 · 0 comments
Open

Rework tracing slog sender #10799

mhofman opened this issue Jan 4, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request telemetry

Comments

@mhofman
Copy link
Member

mhofman commented Jan 4, 2025

What is the Problem Being Solved?

We have an otel-trace slog sender which creates open telemetry traces by inferring relations between slog events. However it is fragile (#10405) and the heuristics require storing out of band information in a sqlite DB. The shape of traces is also not super conductive to investigations when trying to debug executions.

We also have a separate causeway tool that does similar processing on a slog file.

Description of the Design

I believe that open telemetry traces is still a decent format to represent causal execution, most likely if we can properly use multiple links between traces.

The best would be to stop relying on result promise kpid to match calls to deliveries, but that requires kernel changes tracked in #6501. At the very least we should try to make the association stateless so we don't need a large DB of pending calls (or if we do we need to make sure we cleanup these associations when possible)

The most important is to switch from a trace relation where each delivery in linked to the temporal previous one, and instead to a causal one where a notify delivery is primary linked to a subscribe syscall (and secondary linked to a resolve syscall), and a send delivery is linked to a send syscall.

I started down that path while working on #5724 and documented some ideas in mhofman/5724-otel-refactor. I think I may still have some related attempts somewhere.

Security Considerations

None, telemetry

Scaling Considerations

Avoid large tables or complex lookups for correlation

Test Plan

Use mainnet slogs and ingest script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request telemetry
Projects
None yet
Development

No branches or pull requests

2 participants