Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostmetrics featuring ebpf - resource efficient scraping #32446

Open
cforce opened this issue Apr 16, 2024 · 14 comments
Open

hostmetrics featuring ebpf - resource efficient scraping #32446

cforce opened this issue Apr 16, 2024 · 14 comments

Comments

@cforce
Copy link

cforce commented Apr 16, 2024

Component(s)

receiver/hostmetrics

Is your feature request related to a problem? Please describe.

Create an more hardware resource efficient alternative for getting the host metrics (kernel/process) via eBPF
If you aren't familiar with eBPF, you can read more about it on ebpf.io, but in short – eBPF allows us to execute sandboxed programs that extends the Linux kernel without having to change it. We can use eBPF to attach to a tracepoint event when a specific system call is made by a process.

Describe the solution you'd like

A: Run native(c++) program externally (deamon?) and let it sent to a receiver e.g https://github.com/Netflix/bpftop/
or
B: Integrate eBPF scraping into go (might need target platform dep build) eg . by running using "ebpf trace scripts" as cfg
see

Our new sensor uses Inspektor Gadget as its instrumentation layer - allowing us to collect events at the Kernel space and analyze them to provide security insights on workloads running in Kubernetes (insights include those from the host as well as at the container level).

Collect metrics via bpf traces and package as otlp metric message

Additional context

@cforce cforce added enhancement New feature or request needs triage New item requiring triage labels Apr 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@kernelpanic77
Copy link
Contributor

kernelpanic77 commented May 1, 2024

@dmitryax @braydonk any thoughts here ? This sounds like a pretty cool enhancement.

@braydonk
Copy link
Contributor

braydonk commented May 1, 2024

This sounds like a cool idea, but I don't think it should be in this receiver or in contrib for a few reasons:

  • Linux is not the only platform the collector needs to run on, and the current implementation has minimal need for platform specific code. Introducing an eBPF mode would make for a lot of Linux-specific code.
  • The Collector Working Group decided that, in general, we don't want any components in the official collector repos to start subprocesses (this is the best record I can find of that discussion, which took place over a working group meeting Create subprocess extension #6467) which would exclude the idea of running something like bpftop directly from the collector
  • There are currently no components that I know of installing eBPF scripts in this repository, so it does open up a question with the Collector maintainers whether this is something that they are okay with. I am personally of the opinion that eBPF solutions should be external to the collector, but I don't know what the other maintainers think.
  • The hostmetrics receiver has been relatively the same for a long time. While there is a breakage on the way in terms of the way the metrics look thanks to the System Semantic Conventions work, that won't break the way the hostmetrics receiver works in general. This would be a large departure.

So I don't think this should be added to this receiver, but it's not a bad idea. This could instead be a receiver that is published independently that people can include in their own collector builds with the OpenTelemetry Collector Builder. Even better if that receiver could implement the Process Semantic Conventions that are nearing stabilization.

@crobert-1 crobert-1 removed enhancement New feature or request needs triage New item requiring triage labels May 2, 2024
@kernelpanic77
Copy link
Contributor

@braydonk Yes, it make sense. We should not start any subprocesses from the collector repository, but I would be willing to contribute to a custom receiver specifically for eBPF. I think we can use this thread to discuss the same.

@braydonk
Copy link
Contributor

braydonk commented May 6, 2024

@kernelpanic77 Some sort of eBPF receiver could be very cool. As long as:

  • It works with some sort of existing solutions in an obvious way (i.e. something like bpftop or other eBPF tool)
  • Does not start subprocesses
  • (Arguably) does not install it's own hooks unless specifically configured to do so (there's not precedent for this yet, but it logically fits with not wanting to start subprocesses)
  • It does not require CGO

Then that sort of receiver could work well in contrib.

You can find full guidance for introducing new components here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#adding-new-components

Something that the docs above don't explicitly point out is that you can satisfy those 4 criteria for implementing a component in your own repo and build that into a collector yourself to experiment with implementations, which would help with the process stated in that document of how to add new components to contrib.

I'm not an authority on any of this, so there may be other restrictions I'm not mentioning, or even previous discussions of a component like this that I am not privy to. It may be a good idea to attend the Collector Working Group Meeting on Wednesday at 16:00 UTC. You can find the Zoom link in the OpenTelemetry Calendar. You are welcome to join and add to the agenda.

@cforce
Copy link
Author

cforce commented May 23, 2024

To implement the Start method for a eBPFReceiver in the OpenTelemetry Collector, we would need to Load and Attach eBPF Programs: What about using the the Cilium eBPF library to load an eBPF bytecode or compile it from C, and then attach uprobes and uretprobes to the desired functions.

Once the probes are attached, you'll need to collect the data they generate. This usually involves reading data from a BPF map or receiving events from the kernel.

Then transform the collected data into a format that OpenTelemetry understands and send it to the collector.

To avoid starting subprocesses for loading and compiling eBPF programs, can we embed the eBPF bytecode directly into the Go application to ensure the eBPF bytecode is part of the Go binary and can be loaded directly without external dependencies or subprocesses?

@cforce
Copy link
Author

cforce commented Jun 8, 2024

"The continuous profiling agent, that Elastic is donating, is based on eBPF and by that a whole system, always-on solution that observes code and third-party libraries, kernel operations, and other code you don’t own. It eliminates the need for code instrumentation (run-time/bytecode), recompilation, or service restarts with low overhead, low CPU (~1%), and memory usage in production environments."
https://opentelemetry.io/blog/2024/elastic-contributes-continuous-profiling-agent/
How does this profiling with eBPF finally integrates with the collector? Don't you have very similar challenges as mentioned above, so the solution are there already?

@braydonk
Copy link
Contributor

I don't know much about the agent or any particular plans to integrate it. If I had to guess, it's most plausible that this agent won't specifically integrate with the Collector, rather support the Collector as it would any OTLP destination (once the Collector allows OTel Profiles as a signal). There may be other plans I'm not aware of, but that is what would make the most sense to me. And in that scenario, the restrictions we talk about that make eBPF tricky for the OTel Collector wouldn't apply.

@cforce
Copy link
Author

cforce commented Jun 23, 2024

@cforce
Copy link
Author

cforce commented Jul 4, 2024

related cpu optimization shirou/gopsutil#361

@cforce
Copy link
Author

cforce commented Aug 2, 2024

beyla as agent integration could be the way forward for better kernel level resource optimized zero code instrumentation
#34321

@cforce
Copy link
Author

cforce commented Sep 17, 2024

related - profiling overhead - golang/go#57175

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 18, 2024
@cforce
Copy link
Author

cforce commented Nov 22, 2024

related open-telemetry/community#2406

@github-actions github-actions bot removed the Stale label Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants