Skip to content
This repository has been archived by the owner on Nov 7, 2022. It is now read-only.

Fix issues with initial start time and restarts in the Prometheus receiver. #598

Merged
merged 14 commits into from
Jul 18, 2019
Merged

Conversation

dinooliva
Copy link
Contributor

Code is complete - ptal. Currently working on updating/adding unit tests.

return true
}

func (ma *MetricsAdjuster) adjustPoints(metricType metricspb.MetricDescriptor_Type, current, initial []*metricspb.Point) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably get rid of adjustPoints. In adjustTimeseries

func (ma *MetricsAdjuster) adjustTimeseries(metricType metricspb.MetricDescriptor_Type, current, initial *metricspb.TimeSeries) bool {
        cPoints := current.GetPoints()
        iPoints := initial.GetPoints()
        if len(cPoints) !=1 || len(iPoints) != 1 {
            // log if needed.
            return false
        }

	if !ma.adjustPoint(metricType, cPoints[0], iPoints[0]) {
		return false
	}
	current.StartTimestamp = initial.StartTimestamp
	return true
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified it - ptal.

// note: sum of squared deviation not currently supported
initialDist := initial.GetDistributionValue()
currentDist := current.GetDistributionValue()
if currentDist.Count < initialDist.Count {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be if either count or sum is less than previous value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

func (ma *MetricsAdjuster) adjustBuckets(current, initial []*metricspb.DistributionValue_Bucket) bool {
if len(current) != len(initial) {
// this shouldn't happen
ma.logger.Info("len(current buckets) != len(initial buckets)", len(current), len(initial))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you change this to Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally avoid using Error unless something major is going on but that's based on Google prod convention - what is the convention for the oc agent when Error is appropriate?

receiver/prometheusreceiver/internal/transaction.go Outdated Show resolved Hide resolved
@fivesheep
Copy link
Contributor

Oops, looks like this PR is addressing the same issue with #597 which I submitted earlier. I had actually tried a similar approach at the very beginning by caching the first generated TimeSeries, however, it's not going to work properly in some cases, including:

  1. as @rghetia also mentioned, to detect if a remote server has restarted, one needs to compare the current value with the last observed value
  2. the program also needs to clean up the cache, as timeseries can come and go. For example, in our use case we mainly use ocagent to scrape cadvisor metrics, which includes a lot of container spec metrics the lifecycles of which are tied to the containers but not cadvisor.
  3. you might observe new labels from subsequent runs. take a look at the following example:
    scrape run 1
# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760009

from an Appender's perspective, only the label failure_type and container can be observed, as any empty labels with empty labels are removed before hitting the Add/AddFast method of an appender

then in the 2nd run, you get a new one with some empty labels are filled

# HELP container_memory_failures_total Cumulative count of memory allocation failures.
# TYPE container_memory_failures_total counter
container_memory_failures_total{failure_type="pgfault",id="/",image="",name="",scope="container"} 760010
container_memory_failures_total{failure_type="pgfault",id="/docker/6a26e59b145922c813444ca59b46cc3186e6e64064744f973b59f81dbd913fa8",image="somedocker-image",name="some-name",scope="container"} 1.877625e+06

now the timeseries will have all 4 labels in the 2nd run. and it will be challenging to link them back. not only the label values, the label orders are also important.

  1. for any failed scrape (say remote endpoint returning some temp errors), prometheues scraper library will still feed data with the metric labels from cache and value NaN to the appender. the appender also needs to filtered out such data

You might want to take a look at the PR I have submitted and take the works from there. I have also added quite a lot tests trying to capture all these corner cases as well as a redesigned EndToEnd test which is able to produce more stable results and simulate edge cases like remote server is not responding properly

@rghetia
Copy link
Contributor

rghetia commented Jul 2, 2019

#597 covers more cases than #598. However, in #598 MetricAdjuster is independent of scrapping which is useful another envoy prometheus receiver which converts from protobuf (instead of text) to opencensus-proto.

@flands flands modified the milestone: 0.1.9 Jul 2, 2019
@dinooliva dinooliva changed the title WIP: Metrics Adjuster WIP: Fixes initial start time/restarts issue with Prometheus receiver. Jul 11, 2019
@dinooliva
Copy link
Contributor Author

@fivesheep - thank you for the detailed response and my apologies in they delay in responding. As @rghetia mentioned, I took this approach because we'd like to reuse the logic across both the text and proto version of the prometheus receiver. In response to your particular comments:

  1. You're right that looking at the previous value is probably a better approach to detecting a reset than looking at the initial value - that should straightforward to add to this approach.

  2. There's not really a perfect solution to this problem but adding a ttl field (or similarly) should be straightforward to add.

  3. That's a very interesting point that I need to understand it better so I can't say whether or not it's feasible to add to this approach - I'll take a look over your pr shortly.

  4. That's a great point too - that should also be straightforward to add.

@fivesheep
Copy link
Contributor

@dinooliva it does make sense to have a generic solution which is able to adjust income metrics values for different Prometheus protocols . I am not sure if it's possible to make it one more layer up to be a more generic solution for any metrics receivers which has the need to adjust values, with a simple interface like:

type MetricsAdjuster interface {
    AdjustMetric(*data.MetricsData) *data.MetricsData
}

and make it as an option for the metric receiver factory.

other than that in #597 I have also refactored the metricbuilder and add a layer called metric family which makes the code cleaner and easier to understand, you might also want to take a look at it.

@dinooliva
Copy link
Contributor Author

@fivesheep - I spent some time looking over your solution and your notes. I think it may be best to go with a combined solution. My module is strictly for adjusting start times and values based on discovered initial timeseries and resets.

Dealing with newly discovered labels and empty histograms/summaries are not really issues that the metrics adjuster should deal with (I just have to make sure that the signature calculation works appropriately for empty label values) but those issues do need to be dealt with in the metricsbuilder, as you have done.

I plan to add a couple of commits that deal with the issues 1 & 2. Issue 3 I think should be handled by the metricsbuilder and issue 4 I think can also be dealt with using your solution.

@codecov
Copy link

codecov bot commented Jul 17, 2019

Codecov Report

Merging #598 into master will increase coverage by 0.28%.
The diff coverage is 85.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #598      +/-   ##
==========================================
+ Coverage   68.61%   68.89%   +0.28%     
==========================================
  Files          91       92       +1     
  Lines        5939     6064     +125     
==========================================
+ Hits         4075     4178     +103     
- Misses       1650     1668      +18     
- Partials      214      218       +4
Impacted Files Coverage Δ
receiver/prometheusreceiver/internal/ocastore.go 72.41% <100%> (+0.98%) ⬆️
...iver/prometheusreceiver/internal/metricsbuilder.go 100% <100%> (ø) ⬆️
receiver/prometheusreceiver/metrics_receiver.go 72.13% <100%> (+1.95%) ⬆️
...eceiver/prometheusreceiver/internal/transaction.go 89.06% <76.19%> (-7.31%) ⬇️
...er/prometheusreceiver/internal/metrics_adjuster.go 85.21% <85.21%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35b1e1c...84db0f8. Read the comment docs.

@dinooliva dinooliva changed the title WIP: Fixes initial start time/restarts issue with Prometheus receiver. Fixes initial start time/restarts issue with Prometheus receiver. Jul 17, 2019
@dinooliva dinooliva changed the title Fixes initial start time/restarts issue with Prometheus receiver. Fix issues with initial start time and restarts in the Prometheus receiver. Jul 17, 2019
@dinooliva
Copy link
Contributor Author

PTAL

This pr now addresses the issue with restarts and adds more unit tests. Once this pr has been merged, I will submit another pr to deal with cleaning up the cache (I don't want this pr to get too complicated).

Copy link
Contributor

@rghetia rghetia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits. LGTM otherwise.

latestValue := initialValue
if initial != latest {
latestValue += latest.GetDoubleValue()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the word 'latest' confusing. May be previous would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

runScript(t, script)
}

func Test_multiMetrics(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiMetrics test covers everything that individual metric test covers. may be just keep the mulitmetric test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this as well - the individual tests are easier for debugging purposes and documenting what should happen for the different aggregation types.

Unless you object, I propose keeping them for now - if maintaining them is problematic, we can remove them subsequently.

@dinooliva
Copy link
Contributor Author

@songy23 - ptal

Copy link
Contributor

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question, otherwise LGTM.

}

// returns true if at least one of the metric's timeseries was adjusted and false if all of the timeseries are an initial occurence or a reset.
// Types of metrics returned supported by prometheus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about gauge and cumulative int64 values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of metrics aren't generated by the Prometheus -> OC Metrics translation so I didn't add support for them.

@songy23 songy23 merged commit d4f12f7 into census-instrumentation:master Jul 18, 2019
@dinooliva
Copy link
Contributor Author

@fivesheep

We've merged this code, which resolves issue 1 of the 4 issues that you originally noted. Of the other 3 issues:

Issue 2 needs to be handled by the metrics adjuster and I'm working on a solution to it. I had been planning to cleanup at the job-level rather than at the timeseries level but that doesn't seem to work for your use-case (cadvisor) so I will look into handling individual timeseries.

Issue 3 is separate from the metrics adjuster and your code already deals with it.

Issue 4 is also separate from the metrics adjuster and, again, your code deals with it.

Would you be able to update your pr to remove the issue 1 and issue 2 related code while keeping the rest?

@fivesheep
Copy link
Contributor

@dinooliva sure will do. are we still going to have a release with this two PRs, or release will only be in the new project?

@dinooliva
Copy link
Contributor Author

@fivesheep I think it would make sense to make a new release with these changes but I'll have to consult with the project owners to see what their current process is.

@songy23 - any thoughts?

@songy23
Copy link
Contributor

songy23 commented Jul 19, 2019

We can still make patch releases for small improvement, bug fixes etc. @flands Could you help?

@pjanotti
Copy link

That are a few pending PRs that started before we made the announcement that we were in "maintenance mode". Assuming that this can wait a bit let's close those before cutting a new release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants