Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

stats: almost all of the data is inaccurate? #58

Closed
marwan-at-work opened this issue Nov 13, 2019 · 8 comments
Closed

stats: almost all of the data is inaccurate? #58

marwan-at-work opened this issue Nov 13, 2019 · 8 comments
Labels

Comments

@marwan-at-work
Copy link
Contributor

Right now, there's a confirmed bug in OpenCensus when it comes to the DistributionData's Min/Max fields.

To see what the bug is and how it will be solved, see here: census-instrumentation/opencensus-go#1181

The TL;DR is that OpenCensus never restarts the Min/Max values (or any other values really) in between Flush Intervals and therefore the Min will remain the Min for the entirety of the process (so does the max). Therefore, on the DataDog UI, you'll never see a spike in Min's or a drop in Max's, as long as the process is alive. The Mean is also the Mean of the entire process and not every flush interval...So if you have a big spike in latency, it will be diluted by the many hours of regular traffic that will always be part of the Mean.

However, this one bug brought up a lot of underlying potential inaccuracies that makes almost everything in this exporter (on the stats side), inaccurately represented. To get all of the details, see the discussion starting with this comment: census-instrumentation/opencensus-go#1182 (comment)

It's a long discussion but here's the gist:

The way OpenCensus does Count(), Mean() and Sum() aggregations, are in a cumulative way while if the process restarts then those cumulative numbers are reset back to zero and DataDog backend needs to catch that by doing some magical time/value calculation and adjusting the values accordingly.

The Count() and Sum() can be solved by just remembering the last number and subtracting the current value from the previous value. That said, the OpenCensus maintainer did advice against keeping state in the exporter and I'm not yet sure why.

As for Mean(), I'm not sure if that can be solved and if so how.

That said, I just wanted to bring the maintainers attention to these issues to see if I'm missing something.

@MichaelMure
Copy link

Possibly not the most telling, but here is an example of the same metric (average submetric of a distribution) through OpenCensus and directly through Dogstatd.
image

@gbbr gbbr added the stats label Feb 13, 2020
@AlexandreYang
Copy link
Member

Hi @marwan-at-work, sorry for the late reply.

If I understand correctly, this PR is an attempt to fix this issue ? census-instrumentation/opencensus-go#1182 If that's the case, fixing the issue upstream seems the way to go. Or you mean, something should also be done at opencensus-go-exporter-datadog level?

Seems this upstream PR is going well and just need few minor updates. Beside those minor updates, is there anything missing?

@marwan-at-work
Copy link
Contributor Author

@AlexandreYang that PR would only solve min/max but it would not solve Count/Gauge which is the most important in my opinion :)

The only solution is to maintain a complex state in the exporter itself so that it would subtract/get-rid-of monotonicity

@AlexandreYang
Copy link
Member

@marwan-at-work I just read the quite long discussion census-instrumentation/opencensus-go#1182 (comment)

Just added a comment here census-instrumentation/opencensus-go#1182 (comment)

If it's possible to implement Delta types upstream, that would help solving the issue not only for this exporter but for other exporters too. That will also avoid having to keep a state in this exporter (and potentially other exporter for the same reason).

@marwan-at-work
Copy link
Contributor Author

@AlexandreYang a problem with adding Delta types is that the OpenCensus specification must be updated and all implementations need to do it. The work for OC is paused in favor of OpenTelemetry but if the OC team ends up in favor of doing it, I'm on board.

@AlexandreYang
Copy link
Member

Looks like there are some discussions going on about this topic on OpenTelemetry here: open-telemetry/opentelemetry-specification#467 and open-telemetry/opentelemetry-specification#430 (comment)

@AlexandreYang
Copy link
Member

The work for OC is paused in favor of OpenTelemetry but if the OC team ends up in favor of doing it, I'm on board.

@marwan-at-work If possible, it might be worthwhile if you can summary the issue you had in a general way on this open issue open-telemetry/opentelemetry-specification#467 to improve the chance to get that included in the specifications. (the issue seems quite active, only 5 days ago)

@coignetp
Copy link
Contributor

The best way to solve this issue would probably have been to add the Delta types upstream, and it looks like it won't be done in Opencensus.
This issue seems to be solved in OpenTelemetry which Datadog supports and maintains , so i'm closing this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants