Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opentelemetry: Introduce Tracing API #7852

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Conversation

aranjans
Copy link
Contributor

@aranjans aranjans commented Nov 18, 2024

This pull request adds the OpenTelemetry tracing API to the grpc-go opentelemetry plugin as outlined in proposal A72.

RELEASE NOTES:

  • stats/opentelemetry: Introduces an experimental API for enabling and configuring OpenTelemetry tracing within gRPC-go opentelemetry plugin under stats/opentelemetry/experimental. This includes the addition of TraceOptions in the Options struct to allow users to specify the TraceProvider and TextMapPropagator.

@aranjans aranjans added this to the 1.69 Release milestone Nov 18, 2024
@aranjans aranjans added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Nov 18, 2024
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 85.56701% with 28 lines in your changes missing coverage. Please review.

Project coverage is 82.09%. Comparing base (56a14ba) to head (b6503f7).

Files with missing lines Patch % Lines
stats/opentelemetry/trace.go 77.50% 13 Missing and 5 partials ⚠️
stats/opentelemetry/client_metrics.go 83.87% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7852      +/-   ##
==========================================
- Coverage   82.09%   82.09%   -0.01%     
==========================================
  Files         379      380       +1     
  Lines       38261    38405     +144     
==========================================
+ Hits        31409    31527     +118     
- Misses       5549     5568      +19     
- Partials     1303     1310       +7     
Files with missing lines Coverage Δ
clientconn.go 92.15% <100.00%> (-0.27%) ⬇️
...elemetry/experimental/grpc_trace_bin_propagator.go 87.50% <ø> (ø)
stats/opentelemetry/opentelemetry.go 76.82% <100.00%> (+0.95%) ⬆️
stats/opentelemetry/server_metrics.go 90.17% <100.00%> (+0.79%) ⬆️
stream.go 81.72% <100.00%> (+0.03%) ⬆️
stats/opentelemetry/client_metrics.go 85.64% <83.87%> (-2.29%) ⬇️
stats/opentelemetry/trace.go 77.50% <77.50%> (ø)

... and 23 files with indirect coverage changes

@aranjans aranjans requested a review from purnesh42H November 18, 2024 08:22
@purnesh42H purnesh42H self-assigned this Nov 19, 2024
examples/features/opentelemetry/client/main.go Outdated Show resolved Hide resolved
gcp/observability/go.mod Outdated Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
interop/xds/go.sum Outdated Show resolved Hide resolved
@@ -119,22 +134,46 @@ func (h *clientStatsHandler) streamInterceptor(ctx context.Context, desc *grpc.S
}

startTime := time.Now()
ctx, span := h.createCallTraceSpan(ctx, method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should just not call createCallTraceSpan if tracing is disabled

Copy link
Contributor Author

@aranjans aranjans Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/trace.go Outdated Show resolved Hide resolved
stats/opentelemetry/opentelemetry.go Outdated Show resolved Hide resolved
@purnesh42H
Copy link
Contributor

@aranjans please update this with latest main branch to get rid of all go.mod and go.sum changes

@purnesh42H purnesh42H assigned aranjans and unassigned purnesh42H Nov 20, 2024
@aranjans aranjans force-pushed the a72 branch 3 times, most recently from 378ddd3 to 8cb8222 Compare November 21, 2024 17:44
@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 22, 2024
@aranjans
Copy link
Contributor Author

@purnesh42H Thanks for your review, I have addressed all your comments and this PR is ready for another pass.

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more non-test comments. Will review tests in next pass

clientconn.go Outdated Show resolved Hide resolved
stats/handlers.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stream.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more non-test comments. Will review tests in next pass

@purnesh42H purnesh42H assigned aranjans and unassigned purnesh42H Nov 22, 2024
@purnesh42H purnesh42H changed the title Implement A72: OpenTelemetry Tracing stats/opentelemetry: Introduce Tracing API Nov 22, 2024
@purnesh42H
Copy link
Contributor

@aranjans you should link the gRFC and concise your description

@aranjans
Copy link
Contributor Author

aranjans commented Nov 23, 2024

@purnesh42H I have addressed all the comments, and updated the description to link the grfc proposal.
Feel free to close the thread which are resolved now.

@aranjans aranjans assigned purnesh42H and unassigned aranjans Nov 23, 2024
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/client_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/opentelemetry.go Outdated Show resolved Hide resolved
stats/opentelemetry/server_metrics.go Outdated Show resolved Hide resolved
stats/opentelemetry/trace.go Outdated Show resolved Hide resolved
stats/opentelemetry/trace.go Outdated Show resolved Hide resolved
@aranjans
Copy link
Contributor Author

@purnesh42H I have addressed all your comments, and updated the PR. Kindly review the PR.

@aranjans aranjans removed their assignment Dec 18, 2024
Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. 3 more documentation comments

// TestMetricsAndTracesOptionEnabled verifies the integration of metrics and traces
// emitted by the OpenTelemetry instrumentation in a gRPC environment. It sets up a
// stub server with both metrics and traces enabled, and tests the correct emission
// of metrics during a Unary RPC and a Streaming RPC. The test ensures that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: correct emission of metrics and traces

defer cancel()

// Make two RPC's, a unary RPC and a streaming RPC. These should cause
// certain metrics to be emitted, which should be observed through the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is verifying traces. Please mention traces everywhere in tests which are testing traces and they are observed through span exporter

defer cancel()

// Make two RPC's, a unary RPC and a streaming RPC. These should cause
// certain metrics to be emitted, which should be observed through the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. We are verifying the traces and verifying using span exporter

// of metrics during a Unary RPC and a Streaming RPC. The test ensures that the
// emitted metrics reflect the operations performed, including the size of the
// compressed message, and verifies that tracing information is correctly recorded.
func (s) TestMetricsAndTracesOptionEnabled(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aranjans may be one more tests to verify metrics and traces are not emitted if they are not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@purnesh42H I have added the test. Please note: I have added the test which verifies the current setup works as expected when traces and metrics are disabled.

@purnesh42H purnesh42H self-requested a review December 18, 2024 17:10
@purnesh42H
Copy link
Contributor

@aranjans looks like there is some segmentation fault in stats package https://github.com/grpc/grpc-go/actions/runs/12372209143/job/34529917692?pr=7852. It looks like some test is written wrong

@purnesh42H purnesh42H requested a review from dfawley December 19, 2024 03:28
stream.go Outdated
@@ -213,13 +213,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
// Provide an opportunity for the first RPC to see the first service config
// provided by the resolver.
if err := cc.waitForResolvedAddrs(ctx); err != nil {
cc.nameResolutionDelayed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this trying to capture? I don't think this is what you want.

For example, if name resolution takes 30 seconds to occur, and there is no deadline on the RPC, then this will not ever be set.

Also, making this a boolean field on the ClientConn can't be correct.

Copy link
Contributor

@purnesh42H purnesh42H Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its for capturing the delay in name resolution https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information

I think we need to specifically look for context timeout error to set this? waitForResolvedAddrs doesn't directly calculate delay; it waits for resolved addresses, and if resolution takes too long, it will return a timeout error?

What is the expected behavior when there is no deadline?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the feature is to determine, if a name resolution delay occurred during the RPC's lifespan, how long that took.

So basically you want to add events to the lifecycle of the RPC when these things occur:

grpc-go/clientconn.go

Lines 682 to 684 in e957825

select {
case <-cc.firstResolveEvent.Done():
return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dfawley and @purnesh42H for your thoughts, I will create a separate PR for this as a bonus feature.

@aranjans
Copy link
Contributor Author

@dfawley As per our offline discussion to exclude changes of adding event for name resolution delay, and create a separate PR. I have updated the PR.

@@ -215,7 +215,6 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
if err := cc.waitForResolvedAddrs(ctx); err != nil {
return nil, err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just revert this file.

Comment on lines +30 to +31
// TextMapPropagator propagates span context through text map carrier.
TextMapPropagator propagation.TextMapPropagator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user supposed to be able to override this?

We have a default for them, though, right? That should get documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into experimental/opentelemetry, or even experimental/stats/opentelemetry? Either way there should be comments indicating that it will be moved (and where it will be moved to, and when).

Comment on lines +50 to +55
func init() {
otelinternal.SetPluginOption = func(o *Options, po otelinternal.PluginOption) {
o.MetricsOptions.pluginOption = po
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this back to where it was to avoid the code churn please?

@@ -66,6 +68,15 @@ func (h *serverStatsHandler) initializeMetrics() {
rm.registerMetrics(metrics, meter)
}

func (h *serverStatsHandler) initializeTracing() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into a new file? server_tracing.go?

@@ -171,6 +176,14 @@ func removeLeadingSlash(mn string) string {
return strings.TrimLeft(mn, "/")
}

func isMetricsDisabled(mo MetricsOptions) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code now looks like:

if !isMetricsDisabled() {
}

It should instead be:

if isMetricsEnabled() {
}

to avoid a double negative.

Comment on lines +76 to +77
otel.SetTextMapPropagator(h.options.TraceOptions.TextMapPropagator)
otel.SetTracerProvider(h.options.TraceOptions.TracerProvider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, these are global settings. I don't think that's what we want... Only the user should ever be calling these kinds of function, not our library.

return mo.MeterProvider == nil
}

func isTracingDisabled(to experimental.TraceOptions) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss options, but requiring these to be called multiple times from the stats handler feels problematic. Maybe there's a tracing stats handler and a metrics stats handler and we combine them based on which ones are enabled?

At the very least, these should become methods so we don't have to keep reaching into our data structures when calling them to find the options fields.

}
ri := &rpcInfo{
ai.startTime = startTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change how this is set?

Why are you setting a local 3 lines above to time.Now() and then assigning this field to that?

Comment on lines +211 to +212
ai := &attemptInfo{}
startTime := time.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments as above.

@dfawley dfawley assigned aranjans and unassigned dfawley Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants