Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Run example #62

Open
8 tasks
kintel opened this issue May 15, 2020 · 18 comments
Open
8 tasks

Cloud Run example #62

kintel opened this issue May 15, 2020 · 18 comments
Assignees
Labels
enhancement accepted An actionable enhancement for which PRs will be accepted enhancement New feature or request good first issue Good for newcomers priority: p3

Comments

@kintel
Copy link

kintel commented May 15, 2020

To make sure OpenTelemetry and the JS exporter supports Cloud run, write an example that can be deployed to Cloud Run and includes:

  • Auto-instrumentation of incoming requests
  • Auto-instrumentation of outgoing requests
  • (optional) manual instrumentation of outgoing request

Then:

  • Documentation for how to deploy
  • Verification that all spans are visible in Google Cloud Console
  • Verification that canonical labels exist for all spans: SERVICE, REVISION
  • Verification that canonical label naming follows the OpenTelemetry spec

Optional:

  • Automated tests for as much of the above as possible

This ticket is can be split into sub-tickets if appropriate.

@kintel kintel changed the title Cloud Run: example w/docs Cloud Run Support May 15, 2020
@punya punya added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Feb 1, 2021
@patryk-smc
Copy link

Any progress on this? OpenTelemetry doesn't work on Cloud Run / Cloud Functions yet, correct?

@aabmass
Copy link
Contributor

aabmass commented Jul 7, 2021

@patryk-smc I don't have an actual example but tracing should work fine. Use a regular BatchSpanProcessor and call TracerProvider.shutdown() before your program ends. For Cloud Run, you can add a SIGTERM handler to call shutdown. Not sure about Cloud Functions.

Metrics are a little more complicated unfortunately.

@pebo
Copy link

pebo commented Sep 3, 2021

Are there any news on the metrics side of things? I did a quick test run and got errors like:

Send TimeSeries failed: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point

Checking the code, a label opentelemetry_task gets added automatically and the value is set to 'nodejs-' + pid + '@' + hostname;

In Cloud Run the label value will always be something like nodejs-1@localhost and hence it does not help at all. To avoid problems with out of order or to frequent updates of a single time series we could assign a custom label, e.g. with the container instance id as the value, but this would lead to an explosion of the number of time series created and I guess seriously affect the time series query performance.

What's the plan forward for supporting open telementry metrics from cloud run? Is there a workaround for the problems?

@aabmass
Copy link
Contributor

aabmass commented Sep 14, 2021

Thanks for the report on that @pebo, I was not aware of that, but our plans should fix this. We are not actively working on the metrics exporter right now because the upstream OTel metrics API is going through many breaking changes.

@legendsjohn
Copy link

legendsjohn commented Jul 8, 2022

@aabmass Any update on this? Looking at documentation here: https://cloud.google.com/trace/docs/setup/nodejs-ot, support seems to be implied?

For anyone else looking at this, switching from a SimpleSpanProcessor to a BatchSpanProcessor like @aabmass mentioned was the key to get tracing to work well for me - although I don't fully understand why this would be the case.

@legendsjohn
Copy link

legendsjohn commented Aug 3, 2022

For anyone else having issues with tracing and Cloud Run, there were two primary changes I had to implement for them to be exported properly:

  1. Use a BatchSpanProcessor with Google's TraceExporter, ie:

provider.addSpanProcessor(new BatchSpanProcessor(new TraceExporter()));

  1. Force all traces to be sampled with an AlwaysOnSampler, ie:
export const provider = new NodeTracerProvider({
  sampler: new AlwaysOnSampler(),
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: processName,
  }),
  //...
});

Obviously an AlwaysOnSampler might not be the best choice for everyone, but I wanted to ensure all of the traces I expected were being recorded in Cloud Trace.

The second change, forcing all traces to be sampled with an AlwaysOnSampler, was very time consuming for me to figure out. I had similar issues to open-telemetry/opentelemetry-js#3057 (albeit on a completely different platform).

If I ran my instance locally and exported traces to Cloud Trace, all of them would appear. However, after pushing my instance to Cloud Run, only traces associated with serving HTTP OPTIONS requests would be exported. Changing the sampler fixed this issue and all traces, associated with all requests, were logged.

@legendsjohn
Copy link

legendsjohn commented Aug 3, 2022

@aabmass shutdown in TraceExporter doesn't seem to be currently implemented:

Regardless, calling shutdown didn't solve or change any of my issues with Cloud Run & Cloud Tracing

@aabmass
Copy link
Contributor

aabmass commented Feb 6, 2023

It's interesting you needed AlwaysOnSampler, that shouldn't be strictly required. The default OTel sampler is ParentBased(root=AlwaysOn) which would only not sample if the parent is indicating not to sample (the incoming request header) . Which propagator were you using? I believe Cloud Run supports W3C traceparent propagation (OTel's default) and will actually do adaptive sampling for you.

@aabmass shutdown in TraceExporter doesn't seem to be currently implemented:

We don't have any buffering in the trace exporter so the empty implementation is intentional. Calling TracerProvider.shutdown() will also call shutdown() on the BatchSpanProcessor which will flush the spans it has batched. In cases with very sparse/bursty traffic, many serverless processes may be short lived and the buffered spans would never be sent without calling shutdown().

It sounds like at least a minimal example would be useful here, so I'll leave this issue open and up for grabs.

@aabmass aabmass added the enhancement accepted An actionable enhancement for which PRs will be accepted label Feb 6, 2023
@aabmass aabmass changed the title Cloud Run Support Cloud Run example Feb 6, 2023
@aabmass
Copy link
Contributor

aabmass commented Feb 7, 2023

@AkselAllas what do you mean samples by default? Afaik it will do some adaptive sampling based on current QPS, which is why the author of the first blog post had missing spans unless they used always-on sampling.

@AkselAllas
Copy link

AkselAllas commented Feb 8, 2023

Ok. It might be adaptive sampling. 🤔

I experienced ~0.5 sampling even with very low QPS on cloud run.

@eduardosanzb
Copy link

Hi; just to check if anyone is having issues to get the route attribute populated for the cloud_run metrics request_count

@aabmass
Copy link
Contributor

aabmass commented Mar 16, 2023

@eduardosanzb those metrics are not related to this repo or OpenTelemetry. I'd recommend reaching out to support, but I don't think the route label is ever populated.

@eduardosanzb
Copy link

@aabmass Thanks!

@steve-marmalade
Copy link

For anyone else having issues with tracing and Cloud Run, there were two primary changes I had to implement for them to be exported properly:

  1. Use a BatchSpanProcessor with Google's TraceExporter, ie:

provider.addSpanProcessor(new BatchSpanProcessor(new TraceExporter()));

@legendsjohn , Regarding this point, did you also set the CPU Allocation to CPU always allocated? The documentation states that:

Selecting CPU always allocated allows you to execute short-lived background tasks and other asynchronous processing work after returning responses. For example:

  • Leveraging monitoring agents like OpenTelemetry that may assume to be able to run in the background.

My hope is to run with CPU only allocated during request processing as this has major cost implications, but it is not clear to me if if the solution @aabmass proposed of trapping SIGTERM and calling TracerProvider.shutdown() handles the case when the container is still "Serving" but CPU has been deallocated, per:

image

@aabmass
Copy link
Contributor

aabmass commented Jul 5, 2023

it is not clear to me if if the solution @aabmass proposed of trapping SIGTERM and calling TracerProvider.shutdown() handles the case when the container is still "Serving" but CPU has been deallocated, per:

That's fair, I don't think that alone would help. However, with retries (#523), hopefully the request would succeed in the background when CPU is next allocated. Another option is to call await MeterProvider.forceFlush() before responding to your request so the spans are flushed while CPU is allocated.

@aabmass
Copy link
Contributor

aabmass commented Jul 5, 2023

Anyone else coming to this issue, have you tried the OTLP exporter to an OpenTelemetry collector sidecar in Cloud Run? The collector has more robust retry and batching logic and could solve your issues. You can flush data from OpenTelemetry SDK to the collector at the end of each request which should be very fast.

@JonathanHope
Copy link

I had a ton of trouble trying to get cloud trace exports working on cloud run. The suggestions in this thread were super helpful and particularly @aabmass was right on the money a couple of times. I pushed up a minimal working example here in the hopes that will save someone else some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement accepted An actionable enhancement for which PRs will be accepted enhancement New feature or request good first issue Good for newcomers priority: p3
Projects
None yet
Development

No branches or pull requests