Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary metric #11

Open
wojtekmach opened this issue Dec 6, 2019 · 12 comments
Open

Summary metric #11

wojtekmach opened this issue Dec 6, 2019 · 12 comments
Labels
enhancement New feature or request

Comments

@wojtekmach
Copy link
Contributor

I noticed that summary is not yet supported.

@bryannaegele
Copy link
Collaborator

Indeed. I started working on it in October but events at work prevented me from finishing. This is at the top of my list.

@bodgix
Copy link

bodgix commented Apr 21, 2020

Hello @bryannaegele how are you getting on with this feature? I would like to help if you need a hand.

I started studying the code and think that the Aggregator module would need to be updated to support summaries or perhaps protocols could be used to add polymorphism to the aggregator.

In case if you could use a hand, do you have any notes or a working branch with any ideas or plan you have for this feature that could help someone get upto speed and prevent from going in the direction which you wouldn't like?

@bryannaegele
Copy link
Collaborator

Hey @bodgix. I have an idea I was working on locally but haven't pushed it up yet. The biggest holdup was it will require a large bump in the minimum supported version of Elixir to do it efficiently as the implementation relies on atomics which are OTP 21.3+. There was simply no way I could find to efficiently accomplish supporting pValues at scale without relying on atomics or resorting to approximations such as the technique outlined in https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/quantiles.pdf.

I think it's safe at this point to increase the required version, so I'll get it in a reviewable state and pushed up for comment and testing.

@cohawk
Copy link

cohawk commented May 3, 2020

Thanks Bryan - Id rather have a working version on a recent release for usability FYI. Then backpedal on backwards compatibility from there. Just IMHO

@davydog187
Copy link

davydog187 commented Jun 9, 2020

@bryannaegele I'm curious what the minimum supported version this project is targeting. OTP 21.3 is over a year old at this point http://erlang.org/download/otp_src_21.3.readme

I see that the current supported version is Elixir 1.6 which just misses this release of OTP 21.3.. When does the telemetry project decide to deprecate support for older versions of Elixir?

@bryannaegele
Copy link
Collaborator

The main pieces support back to OTP 19. With the release of 23, that will go to 20 (4 versions). The goal of all the major core libraries in the Elixir ecosystem is to support 4 major versions (OTP and Elixir). Supporting more Elixir versions is less difficult in some respects than OTP versions.

With the reporters, especially this one, I'm less concerned with still supporting OTP 20 if we want to support this feature.

A question for the crowd that is related - to support this as efficiently as possible, I need to use atomics and that comes with the additional restriction that only integer measurements would be supported. That would fit the vast majority of use cases, but it would be a conscious limitation. Otherwise, we're stuck with ETS and the lower performance + higher memory demands. Is anybody opposed to only supporting integer values?

And a progress update - work on this has started but it's taking a while since handling of measurements, aggregating, and storage of the aggregations is quite different than any of the other metric types.

@davydog187
Copy link

All of the summary metrics I can think of wanting would be based on native units. Admittedly, I haven't thought beyond that too deeply.

@ghost
Copy link

ghost commented Jul 5, 2020

What would be the outstanding work to be done to have this implemented? My understanding is that the current proposed limitation of integer values is not a problem, because summaries are in most (if not all) cases around numeric values.

@haljin
Copy link

haljin commented Oct 9, 2020

@bryannaegele How is this progressing? Do you need any help?

@bryannaegele
Copy link
Collaborator

Hey all. I have been hesitant to implement this directly in this library without finding an efficient way to accomplish it and it will likely involve a great deal of work. The limited time I have available is focused on OpenTelemetry and other projects, so I open this up to the community. There are a number of challenges that must be addressed. I'll lay out the problem-space as I understand it and potential solutions. I'll aim to be exhaustive so that y'all can correct me on my misunderstandings 😄

Prometheus Histograms vs Summaries

The manner in which the aggregator addresses Histograms works due to the commutative properties of them. We can aggregate the newly observed values and add them to the previous summary, removing any need to keep the samples around. We know the buckets and tags ahead of time, so this whole process is relatively efficient.

The base requirement for a Summary in Prometheus is an aggregation of sum and count for each time series. This is relatively straightforward to implement with the existing code since you could use the implementation for counts and sums under the hood. The trick is you'd then need to store the aggregations in time buckets and use the prior aggregations based on the time windows provided by the user, e.g. store aggregations for the sets and, given a 1 minute window, add the prior aggregations which fall into that window.

On top of that, we'd need a janitor process to clean out aggregates which have expired beyond all specified time periods. It is a common practice for users to specify more than one time window for observations, i.e. 1-minute and 5-minute p99s.

All of this is what I would aim for in a base implementation to satisfy Prometheus' requirements and something that could be aimed for in an initial implementation, but likely does not satisfy what users want - pValues.

Data and Duplication Across Tags and Time

Each metric + label + label_value set is a distinct time series requiring a separate calculation. I won't g into depth on accurately calculate a pValue (there's plenty of literature on that but, in general, we have to sort all of the values and determine the value within the specified bucket, i.e. p99 requires 100 buckets, p99.9 requires 1000, etc.

Since these are sliding windows and both the number of samples and their values can skew heavily within any given part of the window, you must keep all samples around and merge sort them each time you want to calculate it.

There are definitely some tricks around this to make it somewhat more efficient, but you can likely see how this becomes a very intensive set of operations at scale in the BEAM with a strict and naive implementation and a tag or two involved. Assuming a service with 100,000 RPS, a 5-minute window comes to 30,000,000 samples for a single metric with no tags.

Potential Solution

The basic design I originally landed on to do this efficiently with a strictly naive implementation likely involves heavy use of atomics but it can probably be done with ets sets tables. Additionally, each time window should probably require a separate metric definition to ease implementation.

I do not have any solution for how to make any of this work with the current aggregate mechanisms and exporter. I took a stab at this a few times and there's no clear path that I could find. Hopefully somebody else can.

Overall, I'm not convinced that the amount of work and change required is worth it. My inclination is this would be better handled through opentelemetry. :/

@dantswain
Copy link

Thanks for the update @bryannaegele !

I have a couple ideas/questions:

  • How hard would it be to allow for a plugin approach here? I.e., let the user implement their own statistics collection to fit their needs. It's noble to want to provide something here that would work in the highest-load situations, but for a lot of us it may be sufficient to implement something that would be simple and not need to stand up to such a demanding environment. If this was feasible, I'd be willing to implement and open-source something that would work for more moderate loads.

  • How does OpenTelemetry mesh with prometheus and Erlang's telemetry module? When you say "My inclination is this would be better handled through opentelemetry", it makes me wonder if it's inevitable that this repo will be replaced by something else in the not-too-distant future, which might impact peoples' willingness to contribute.

@bryannaegele
Copy link
Collaborator

I think a generic approach is fine with caveats as to what users should look for around performance. If you have a design in mind, I'm happy to give input before undertaking it.

For the Otel bit, this library isn't going away and they're different considerations for what someone's needs are and there's no clear migration path for Otel metrics anytime soon, so don't worry about this library going away. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

7 participants