Fix issues with initial start time and restarts in the Prometheus receiver. #598

dinooliva · 2019-07-01T17:33:36Z

Code is complete - ptal. Currently working on updating/adding unit tests.

…restarts.

rghetia · 2019-07-01T22:34:31Z

receiver/prometheusreceiver/internal/metrics_adjuster.go

+	return true
+}
+
+func (ma *MetricsAdjuster) adjustPoints(metricType metricspb.MetricDescriptor_Type, current, initial []*metricspb.Point) bool {


you can probably get rid of adjustPoints. In adjustTimeseries

func (ma *MetricsAdjuster) adjustTimeseries(metricType metricspb.MetricDescriptor_Type, current, initial *metricspb.TimeSeries) bool { cPoints := current.GetPoints() iPoints := initial.GetPoints() if len(cPoints) !=1 || len(iPoints) != 1 { // log if needed. return false } if !ma.adjustPoint(metricType, cPoints[0], iPoints[0]) { return false } current.StartTimestamp = initial.StartTimestamp return true }

I've simplified it - ptal.

rghetia · 2019-07-01T22:36:58Z

receiver/prometheusreceiver/internal/metrics_adjuster.go

+		// note: sum of squared deviation not currently supported
+		initialDist := initial.GetDistributionValue()
+		currentDist := current.GetDistributionValue()
+		if currentDist.Count < initialDist.Count {


it should be if either count or sum is less than previous value

rghetia · 2019-07-01T22:43:32Z

receiver/prometheusreceiver/internal/metrics_adjuster.go

+func (ma *MetricsAdjuster) adjustBuckets(current, initial []*metricspb.DistributionValue_Bucket) bool {
+	if len(current) != len(initial) {
+		// this shouldn't happen
+		ma.logger.Info("len(current buckets) != len(initial buckets)", len(current), len(initial))


should you change this to Error?

I generally avoid using Error unless something major is going on but that's based on Google prod convention - what is the convention for the oc agent when Error is appropriate?

receiver/prometheusreceiver/internal/transaction.go

fivesheep · 2019-07-02T17:28:29Z

Oops, looks like this PR is addressing the same issue with #597 which I submitted earlier. I had actually tried a similar approach at the very beginning by caching the first generated TimeSeries, however, it's not going to work properly in some cases, including:

as @rghetia also mentioned, to detect if a remote server has restarted, one needs to compare the current value with the last observed value
the program also needs to clean up the cache, as timeseries can come and go. For example, in our use case we mainly use ocagent to scrape cadvisor metrics, which includes a lot of container spec metrics the lifecycles of which are tied to the containers but not cadvisor.
you might observe new labels from subsequent runs. take a look at the following example:
scrape run 1

# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760009

from an Appender's perspective, only the label failure_type and container can be observed, as any empty labels with empty labels are removed before hitting the Add/AddFast method of an appender

then in the 2nd run, you get a new one with some empty labels are filled

# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760010
container_memory_failures_total{failure_type="pgfault",id="/docker/6a26e59b145922c813444ca59b46cc3186e6e64064744f973b59f81dbd913fa8",image="somedocker-image",name="some-name",scope="container"} 1.877625e+06

now the timeseries will have all 4 labels in the 2nd run. and it will be challenging to link them back. not only the label values, the label orders are also important.

for any failed scrape (say remote endpoint returning some temp errors), prometheues scraper library will still feed data with the metric labels from cache and value NaN to the appender. the appender also needs to filtered out such data

You might want to take a look at the PR I have submitted and take the works from there. I have also added quite a lot tests trying to capture all these corner cases as well as a redesigned EndToEnd test which is able to produce more stable results and simulate edge cases like remote server is not responding properly

rghetia · 2019-07-02T18:46:32Z

#597 covers more cases than #598. However, in #598 MetricAdjuster is independent of scrapping which is useful another envoy prometheus receiver which converts from protobuf (instead of text) to opencensus-proto.

dinooliva · 2019-07-11T03:34:37Z

@fivesheep - thank you for the detailed response and my apologies in they delay in responding. As @rghetia mentioned, I took this approach because we'd like to reuse the logic across both the text and proto version of the prometheus receiver. In response to your particular comments:

You're right that looking at the previous value is probably a better approach to detecting a reset than looking at the initial value - that should straightforward to add to this approach.
There's not really a perfect solution to this problem but adding a ttl field (or similarly) should be straightforward to add.
That's a very interesting point that I need to understand it better so I can't say whether or not it's feasible to add to this approach - I'll take a look over your pr shortly.
That's a great point too - that should also be straightforward to add.

fivesheep · 2019-07-11T16:41:54Z

@dinooliva it does make sense to have a generic solution which is able to adjust income metrics values for different Prometheus protocols . I am not sure if it's possible to make it one more layer up to be a more generic solution for any metrics receivers which has the need to adjust values, with a simple interface like:

type MetricsAdjuster interface {
    AdjustMetric(*data.MetricsData) *data.MetricsData
}

and make it as an option for the metric receiver factory.

other than that in #597 I have also refactored the metricbuilder and add a layer called metric family which makes the code cleaner and easier to understand, you might also want to take a look at it.

dinooliva · 2019-07-16T06:17:48Z

@fivesheep - I spent some time looking over your solution and your notes. I think it may be best to go with a combined solution. My module is strictly for adjusting start times and values based on discovered initial timeseries and resets.

Dealing with newly discovered labels and empty histograms/summaries are not really issues that the metrics adjuster should deal with (I just have to make sure that the signature calculation works appropriately for empty label values) but those issues do need to be dealt with in the metricsbuilder, as you have done.

I plan to add a couple of commits that deal with the issues 1 & 2. Issue 3 I think should be handled by the metricsbuilder and issue 4 I think can also be dealt with using your solution.

codecov · 2019-07-17T19:46:23Z

Codecov Report

Merging #598 into master will increase coverage by 0.28%.
The diff coverage is 85.13%.

@@            Coverage Diff             @@
##           master     #598      +/-   ##
==========================================
+ Coverage   68.61%   68.89%   +0.28%     
==========================================
  Files          91       92       +1     
  Lines        5939     6064     +125     
==========================================
+ Hits         4075     4178     +103     
- Misses       1650     1668      +18     
- Partials      214      218       +4

Impacted Files	Coverage Δ
receiver/prometheusreceiver/internal/ocastore.go	`72.41% <100%> (+0.98%)`	⬆️
...iver/prometheusreceiver/internal/metricsbuilder.go	`100% <100%> (ø)`	⬆️
receiver/prometheusreceiver/metrics_receiver.go	`72.13% <100%> (+1.95%)`	⬆️
...eceiver/prometheusreceiver/internal/transaction.go	`89.06% <76.19%> (-7.31%)`	⬇️
...er/prometheusreceiver/internal/metrics_adjuster.go	`85.21% <85.21%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35b1e1c...84db0f8. Read the comment docs.

dinooliva · 2019-07-17T19:59:31Z

PTAL

This pr now addresses the issue with restarts and adds more unit tests. Once this pr has been merged, I will submit another pr to deal with cleaning up the cache (I don't want this pr to get too complicated).

…ries

…or readability

rghetia

couple of nits. LGTM otherwise.

rghetia · 2019-07-18T17:59:14Z

receiver/prometheusreceiver/internal/metrics_adjuster.go

+		latestValue := initialValue
+		if initial != latest {
+			latestValue += latest.GetDoubleValue()
+		}


I find the word 'latest' confusing. May be previous would be better.

rghetia · 2019-07-18T18:34:41Z

receiver/prometheusreceiver/internal/metrics_adjuster_test.go

+	runScript(t, script)
+}
+
+func Test_multiMetrics(t *testing.T) {


MultiMetrics test covers everything that individual metric test covers. may be just keep the mulitmetric test.

I considered this as well - the individual tests are easier for debugging purposes and documenting what should happen for the different aggregation types.

Unless you object, I propose keeping them for now - if maintaining them is problematic, we can remove them subsequently.

dinooliva · 2019-07-18T19:44:49Z

@songy23 - ptal

songy23

One minor question, otherwise LGTM.

songy23 · 2019-07-18T20:40:15Z

receiver/prometheusreceiver/internal/metrics_adjuster.go

+}
+
+// returns true if at least one of the metric's timeseries was adjusted and false if all of the timeseries are an initial occurence or a reset.
+// Types of metrics returned supported by prometheus:


How about gauge and cumulative int64 values?

These kinds of metrics aren't generated by the Prometheus -> OC Metrics translation so I didn't add support for them.

dinooliva · 2019-07-19T00:21:24Z

@fivesheep

We've merged this code, which resolves issue 1 of the 4 issues that you originally noted. Of the other 3 issues:

Issue 2 needs to be handled by the metrics adjuster and I'm working on a solution to it. I had been planning to cleanup at the job-level rather than at the timeseries level but that doesn't seem to work for your use-case (cadvisor) so I will look into handling individual timeseries.

Issue 3 is separate from the metrics adjuster and your code already deals with it.

Issue 4 is also separate from the metrics adjuster and, again, your code deals with it.

Would you be able to update your pr to remove the issue 1 and issue 2 related code while keeping the rest?

fivesheep · 2019-07-19T01:04:51Z

@dinooliva sure will do. are we still going to have a release with this two PRs, or release will only be in the new project?

dinooliva · 2019-07-19T05:58:32Z

@fivesheep I think it would make sense to make a new release with these changes but I'll have to consult with the project owners to see what their current process is.

@songy23 - any thoughts?

songy23 · 2019-07-19T15:14:10Z

We can still make patch releases for small improvement, bug fixes etc. @flands Could you help?

pjanotti · 2019-07-19T15:54:45Z

That are a few pending PRs that started before we made the announcement that we were in "maintenance mode". Assuming that this can wait a bit let's close those before cutting a new release.

dinooliva added 4 commits May 29, 2019 17:06

Updates dependencies to use dinooliva/promreceiver

49371ca

Update w/upstream.

75443ae

Adds metrics_asjuster for dealing with inital time series values and …

ab3d088

…restarts.

Updates code to conform to contribution specs.

8b5004e

dinooliva requested review from pjanotti, songy23, tigrannajaryan and a team as code owners July 1, 2019 17:33

googlebot added the cla: yes label Jul 1, 2019

songy23 requested a review from rghetia July 1, 2019 21:06

rghetia reviewed Jul 1, 2019

View reviewed changes

flands assigned dinooliva Jul 2, 2019

flands modified the milestone: 0.1.9 Jul 2, 2019

dinooliva changed the title ~~WIP: Metrics Adjuster~~ WIP: Fixes initial start time/restarts issue with Prometheus receiver. Jul 11, 2019

dinooliva added 5 commits July 17, 2019 11:02

Update MetricsAdjuster to handle resets appropriately

3350e5c

Apply gofmt to non-conforming files

b879e92

Make metrics adjustment configurable

2eada4b

Apply gofmt to metrics_receiver

fd583cc

Fix all issues related to fmt/vet/lint

f0d2304

dinooliva changed the title ~~WIP: Fixes initial start time/restarts issue with Prometheus receiver.~~ Fixes initial start time/restarts issue with Prometheus receiver. Jul 17, 2019

dinooliva changed the title ~~Fixes initial start time/restarts issue with Prometheus receiver.~~ Fix issues with initial start time and restarts in the Prometheus receiver. Jul 17, 2019

Extend unit tests to include multiple simultaneous metrics and timese…

3c814d7

…ries

Add unit test for implicit/explicit empty label values; rename vars f…

37e3b6b

…or readability

rghetia reviewed Jul 18, 2019

View reviewed changes

Clarify naming (latest -> previous) for timeseriesinfo

71d3a62

rghetia approved these changes Jul 18, 2019

View reviewed changes

Merge branch 'master' into metrics-adjuster

2abb89c

songy23 approved these changes Jul 18, 2019

View reviewed changes

Merge branch 'master' into metrics-adjuster

84db0f8

songy23 merged commit d4f12f7 into census-instrumentation:master Jul 18, 2019

fivesheep mentioned this pull request Jul 20, 2019

Fix starttime and summary has no quantiles for prometheus receiver #597

Merged

dinooliva mentioned this pull request Jul 26, 2019

Prometheus Receiver: Initial time series values and resets not handle appropriately #588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues with initial start time and restarts in the Prometheus receiver. #598

Fix issues with initial start time and restarts in the Prometheus receiver. #598

dinooliva commented Jul 1, 2019

rghetia Jul 1, 2019

dinooliva Jul 17, 2019

rghetia Jul 1, 2019

dinooliva Jul 17, 2019

rghetia Jul 1, 2019

dinooliva Jul 17, 2019

fivesheep commented Jul 2, 2019

rghetia commented Jul 2, 2019

dinooliva commented Jul 11, 2019

fivesheep commented Jul 11, 2019

dinooliva commented Jul 16, 2019

codecov bot commented Jul 17, 2019 •

edited

Loading

dinooliva commented Jul 17, 2019

rghetia left a comment

rghetia Jul 18, 2019

dinooliva Jul 18, 2019

rghetia Jul 18, 2019

dinooliva Jul 18, 2019

dinooliva commented Jul 18, 2019

songy23 left a comment

songy23 Jul 18, 2019

dinooliva Jul 18, 2019

dinooliva commented Jul 19, 2019

fivesheep commented Jul 19, 2019

dinooliva commented Jul 19, 2019

songy23 commented Jul 19, 2019

pjanotti commented Jul 19, 2019

Fix issues with initial start time and restarts in the Prometheus receiver. #598

Fix issues with initial start time and restarts in the Prometheus receiver. #598

Conversation

dinooliva commented Jul 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fivesheep commented Jul 2, 2019

rghetia commented Jul 2, 2019

dinooliva commented Jul 11, 2019

fivesheep commented Jul 11, 2019

dinooliva commented Jul 16, 2019

codecov bot commented Jul 17, 2019 • edited Loading

Codecov Report

dinooliva commented Jul 17, 2019

rghetia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dinooliva commented Jul 18, 2019

songy23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dinooliva commented Jul 19, 2019

fivesheep commented Jul 19, 2019

dinooliva commented Jul 19, 2019

songy23 commented Jul 19, 2019

pjanotti commented Jul 19, 2019

codecov bot commented Jul 17, 2019 •

edited

Loading