Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why so many interfaces? #23

Closed
freeformz opened this issue Jun 25, 2019 · 30 comments
Closed

Why so many interfaces? #23

freeformz opened this issue Jun 25, 2019 · 30 comments
Milestone

Comments

@freeformz
Copy link
Contributor

My expectation when I started looking at the code base was that I would find mostly concrete implementations of the basic data types (Traces, Spans, SpanContexts, Logs, Measurements, Metrics, etc) with interfaces for vendors to hook into for enriching that data and exporting it either in batch or streaming forms.

Looking through the code base I'm sort of astonished by the number of interfaces that exist over concrete data types.

I realize that vendors want some control over various aspects of the implementation, but starting with making everything an interface seems like a poor design choice to me.

Can anyone explain this or point me to the docs that explain the motivation behind this?

@abraithwaite
Copy link

abraithwaite commented Jun 25, 2019

Not that I have a horse in this race, but the excessive interfaces and lack of concrete types made me hate working with the opentracing library. I would probably suggest against using opentelemetry in my company if it goes the same way.

In particular, we had a requirement to serialize a span on an internal object, but were unable to because it meant we had to implement and instantiate our own tracer, which itself was too abstract to really be of use.

opentracing/opentracing-go#184
opentracing/opentracing-go#127

@iredelmeier
Copy link
Member

I agree, at least for most of the mentioned types.

My understanding is that the initial motivation is primarily historical, and that the current code is very much intended as a prototype.

I'm happy to delve into this some more and/or be assigned, albeit I don't know how much capacity I'll have to investigate this week.

@bogdandrutu
Copy link
Member

cc @yurishkuro @tedsuo I think the requirement to have an api package that allows to re-implement the majority of the classes came from OpenTracing. There are a lot of benefits to have this: vendors can re-implement the whole API if the SDK does not satisfy their requirements, but all the instrumented code stays the same. Also OpenTelemetry can decide to completely change the implementation of the API if needed.

We will have an SDK package that will be very close to what you describe (plugins, exporters, etc.) that vendors can use to send data to their backend.

But based on the merging deal we need to have an api package with "so many interfaces" :)

@iredelmeier
Copy link
Member

I believe this is also tied in some ways to #20. The API and SDK do not inherently need the same types, e.g., the API can provide structs and the SDK can accept interfaces.

@freeformz
Copy link
Contributor Author

When I read @bogdandrutu's and @iredelmeier's comments they don't seem to align.

According to https://github.com/open-telemetry/opentelemetry-specification/blob/be15b9ceff11bea36dd9d50dc95e37eb998b6adc/specification/library-guidelines.md#language-library-generic-design:

  • API is what I would use in my code or library that I'm writing and provide a "minimal implementation"
  • SDK is "plugged" into the API to substitute that minimal implementation among other things.

When instrumenting code, I prefer to use concrete unboxed types to avoid interface overhead and to make it simpler to inspect what is actually happening in my code.

vendors can re-implement the whole API if the SDK does not satisfy their requirements, but all the instrumented code stays the same.

This is a confusing statement to me. It also implies that API has to be a bunch of interfaces as there would be no other way to maintain the "instrumented code stays the same" invariant.

I am not a vendor, but I don't really understand the need to "re-implement the whole API". Why would they want to re-implement basic building blocks?

There are ways to enhance basic building blocks w/o re-implementing them and w/o using interfaces.

Maybe some of this (again) goes down to talking about different things when attempting to talk about the same thing?

@freeformz
Copy link
Contributor Author

PS: I can stop "complaining" (what I feel like I'm doing) about this for a bit and wait until work on #20 starts in the hopes that will clarify my confusion.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jun 26, 2019

When you have concrete types in the API you either have to use mostly the same types in the implementation or you have to perform data translation for API calls. The former places more restrictions on the implementation the later is a performance penalty.

I also am not a big fan of interface proliferation however if we want to have concrete types in the API it means we need to put much more effort into getting these types right from implementation perspective. I do not know how feasible is that given our timeframes.

I am not a vendor, but I don't really understand the need to "re-implement the whole API". Why would they want to re-implement basic building blocks?

There has been a discussion around this. An example is a streaming implementation when most span operations result in events that immediately get streamed to the instrumentation target (e.g. StartSpan or AddAttribute would be an event). Compare this to the more traditional implementation which accumulates span data and ships it when the span is complete.

These 2 approaches are significantly different that it makes sense to place the boundary between API and implementation very close to the surface of the API to give more flexibility to the implementation. This also dictates the need to have everything in interfaces.

So the interface-heavy design and the particular placement of API<->SDK boundary is primarily a consequence of the requirements.

@tigrannajaryan
Copy link
Member

There are ways to enhance basic building blocks w/o re-implementing them and w/o using interfaces.

@freeformz given the requirements that OpenTelemetry has how would you approach this? I'd be interested to discuss alternates.

@iredelmeier
Copy link
Member

When instrumenting code, I prefer to use concrete unboxed types to avoid interface overhead and to make it simpler to inspect what is actually happening in my code.

vendors can re-implement the whole API if the SDK does not satisfy their requirements, but all the instrumented code stays the same.

This is a confusing statement to me. It also implies that API has to be a bunch of interfaces as there would be no other way to maintain the "instrumented code stays the same" invariant.

I am not a vendor, but I don't really understand the need to "re-implement the whole API". Why would they want to re-implement basic building blocks?

There are ways to enhance basic building blocks w/o re-implementing them and w/o using interfaces.

Maybe some of this (again) goes down to talking about different things when attempting to talk about the same thing?

Agreed on all these fronts!

I have a hypothesis that there's still some work to be done around the API<->SDK boundary in order to solve this. I think that this is why my and @bogdandrutu's points aren't aligned: his is (with good reason!) based on the current state of things, i.e., the spec, whereas mine was based on some handwavey future vision.

PS: I can stop "complaining" (what I feel like I'm doing) about this for a bit and wait until work on #20 starts in the hopes that will clarify my confusion.

@freeformz I don't see your point (or @abraithwaite's) as "complaints". This is, in my opinion, the sort of feedback that, if anything, we should be addressing before it's too late. Then again, I may be biased based on my own experience tracing Go projects :)

Would you be interested in collaborating on a relevant RFC around the API<->SDK boundary? I've had some relevant ideas bouncing around in my head, albeit they're currently very immature.

@abraithwaite
Copy link

given the requirements that OpenTelemetry has how would you approach this?

What about a hooks interface? Could be an optional component of a bigger framework which allows that level of instrumentation.

@jmacd
Copy link
Contributor

jmacd commented Jun 26, 2019

OpenTracing had these "basictracer" packages which were not fully developed, but were approximately like the SDKs we're building for OpenTelemetry. They provide a simple way to support the complete API while only having to implement a narrow interface. In OpenTracing/basictracer this was called "SpanRecorder".

The default v1 SDK for OpenTelemetry is similar to that basictracer concept, where the expectation is that vendors will plug in Span and Metric exporters to provide customization for their platform.

So, why so many interfaces? I see this as allowing for radically different SDKs that support different exporter models. You should only need to implement those interfaces if you're trying to invent a new exporter. The one under discussion presently is a streaming implementation--so imagine there were going to be two SDKs, one that provides everything in the process, and one that assumes everything is done outside the process. Those implementations are so different that they will require different implementations of the interfaces. If you're only trying to implement variations on a span or metrics exporter, you can use the provided SDK. In the current repository, there is an experimental exporter named observer which supports a streaming implementation. It will remain experimental, but should demonstrate the concept as we move towards the Span and Metric exporter model for v1.

For Go specifically, interfaces are only problematic IMO when they are used to represent things that must be allocated on a per-call basis, since interfaces turn into memory allocations when they do not contain pointer values. It's a good thing to use interfaces when you've already allocated an object. On the other hand, if you've got to allocate a new object in order to pass a value as an interface, or if you're passing a struct inside an interface, then the API requires unnecessary memory allocations.

As an example, yesterday this PR was merged with an interface that I don't like: https://github.com/open-telemetry/opentelemetry-go/pull/22/files#diff-f9ed0a11848dce30d7a32470f1bbb3dbR170

This new Event interface requires allocating a new object whenever AddEvent is called. I wouldn't have approved this PR and we'll have to discuss it. (CC: @rghetia @bogdandrutu)

@freeformz
Copy link
Contributor Author

Would you be interested in collaborating on a relevant RFC around the API<->SDK boundary? I've had some relevant ideas bouncing around in my head, albeit they're currently very immature.

@iredelmeier Yes. Hit me up on Email/Gopher slack/Twitter DM.

@bogdandrutu
Copy link
Member

Relevant to this discussion open-telemetry/opentelemetry-specification#165

@iredelmeier
Copy link
Member

I've started spiking on an architecture prototype here, which is intended to address both the interface issue and some other questions.

StartSpan, FinishSpan, and SpanExporter are probably the pieces most relevant to this particular conversation. The OpenTracing and OpenCensus bridges as well as the examples might also be worth a look.

Note: My "prototype" is NOT intended as a foil to opentelemetry-go, let alone as a replacement for it. Rather, my intention was to spike on some ideas which I intend to split out as separate RFCs. The fact that my code is in go is mostly incidental... although it does let me cheat a bit and use context.Context rather than address its absence in another language :)

@freeformz @abraithwaite I'm VERY interested in whether the prototype's architecture is along the lines of what you're looking for, or if it's totally in the wrong ballpark. Admittedly, there are a whole bunch of other ideas intermingled there - sorry about any and all noise 😅

@axw
Copy link

axw commented Jul 2, 2019

We (Elastic APM) have a couple of things we'd like to be able to do that would be made ~impossible with concrete span types and functions:

  1. For each span, record the "local root span ID" (we call it a transaction ID). I believe Zipkin would also like to use this. This either needs to be recorded in the TraceContext, or otherwise we need control over the Span implementation. see this OpenCensus issue for more details.

  2. We have our own native API, independent of OpenTracing and OpenCensus, which we would also like to bridge to the OpenTelemetry API. It would be easy enough to layer OpenTelemetry on top, but we would ideally also like to be able to extract TraceContext from our own types, so we can mix our own automatic instrumentation with custom OpenTelemetry instrumentation. This would mean controlling the behaviour of TraceContextFromContext.

I realise there are competing concerns (allocations, performance, etc.), but just a bit more food for thought.

@cstockton
Copy link

@iredelmeier I wanted to just take a moment to say that I believe https://godoc.org/github.com/iredelmeier/opentelemetry-playground/trace#Span - is a fundamental design flaw that is repeated across every SDK / library of the project. Spans should not exist as a user facing concept at all in the SDK, but if they did the finish times have got to go. So many quirks / bugs make instrumenting code more difficult because you have to be careful about span relationships as they are these infinitely accumulating buffers.

The SDK's should carry observations out of the process with minimal processing, nothing more. Most every criticism I've seen in these issues is a direct results of the libraries doing far more than should be happening inside the processes of business critical services. Here are things that are not important:

  • A span was not finished, was finished ONCE, or was finished TWICE
  • A span has a traceid that DOES resolve to another active span, or it DOES NOT
  • A span stays open for 1 SECOND, or stays open 1 WEEK blocked in a Serve(ctx) error func
  • A span has 0 log calls, or 1,000,000

The theme here is that the SDK doesn't need to validate, make decisions, map objects, create lock contention all for things that can happen later in the pipeline once it's far clear of my critical core services. I find it concerning that @axw has to be worry about the API details of the Go SDK. Elastic APM should be one of an unlimited number of endpoints I can point the openetelemetry-go package to, meaning the extent of the Elastic APM teams integration in my software should be OPENTELEMETRY_ENDPOINT="https://traces.elastic.url". I shouldn't need to import a line of code written by Elastic APM to ship to them, I'm sure the code is fantastic but experience has shown me the more code you import the more that can go wrong.

Alternatively, a vendor neutral, extensible wire format emerges and each vendor writes only a server endpoint in their preferred language- which may just be exposing another optional listener on their existing servers. Every "SDK" becomes a "trace client" that is just one of many potential implementations, that has its own trade offs which I may or may not choose to accept. An option taken from me by the approach being used here and in opentracing-go, it's just not viable to write a competing SDK implementation given that you would need to also provide the implementation of every potential vendor to compete.

I know most of this echos other issues I've raised like open-telemetry/opentelemetry-specification#20 / #19 -but I am hoping a comment makes someone with some influence reconsider some of these design decisions.

@yurishkuro
Copy link
Member

yurishkuro commented Jul 12, 2019

@cstockton I don't think anyone is denying the value of vendor-neutral data format, but it's not the goal of the OpenTelemetry project (but is a goal of W3C Distributed Tracing working group). We need to think about the audience. If I am an application developer and I want to capture trace events from my code, a formal definition of an exportable data format does me little good - I am not going to implement it myself. I simply need a couple of API methods to indicate "I am starting this work, with this metadata, and now I am finishing it". And then I need to bind that API to some implementation that will take care of exporting the data to whichever backend I happen to favor that week. Once my code is instrumented with the API, I never have to touch it again, while the implementation may change from a very slim one that streams every datum out of process to the pipeline (as you described), to heavier one that has some smarts about sampling, trimming tags/logs, and other tricks that maybe can reduce the overhead to my app (if implemented correctly). The point is, the API-first design allows for this possibility, whereas data-format-first design does not.

And the second, not less important point, is that if I use libraries A, B, C in my application, I don't want every one of those libraries to be responsible for exporting telemetry in the common data format. It may be ok for telemetry that is completely encapsulated within the library, but tracing clearly doesn't work that way, there needs to be some common context transfer across library boundaries, which again requires API-first design (although with a smaller API surface than the actual trace points).

@cstockton
Copy link

@yurishkuro I more or less agree with your stated user needs, what I don't see is any logical connection with the current implementation and user needs. You're the first person to make this statement as everyone else has openly stated backwards compatibility and software-reuse trumps ideal design. I also don't actually see any supporting argument for your first statement, you stated a detailed user story but provided no arguments as to why a common data format is harmful to that goal:

The point is, the API-first design allows for this possibility, whereas data-format-first design does not.

The next paragraph I find to be entirely an argument for a common wire format:

And the second, not less important point, is that if I use libraries A, B, C in my application, I don't want every one of those libraries to be responsible for exporting telemetry in the common data format.

I don't want this either, today when I import opentracing-go I also have to import jaeger to ship the traces. This is because there is no common vendor neutral wire format. The answer being used by Opentracing-go used an API first design and because of it every trace vendor needs to implement a language specific tracing driver which users must import into their code. This means API changes for better ergonomics are near impossible as you have to coordinate with all potential drivers in lock-step, hence opentracing-go having close to as many deprecated methods as it does non-deprecated. This is what happens when you have every single vendor integrate in the process space to sidestep the design of a vendor neutral transport protocol.

While I'm not yet entirely sure what opentelemetry is planning, it is a similar paradigm except it also allows vendors to import binary blobs (Go Plugins)... into my process. Completely, utterly, unacceptable... I go into this in more depth in Issues with current design. So now it has to take existing ABI into consideration when it goes to add new features.

It may be ok for telemetry that is completely encapsulated within the library, but tracing clearly doesn't work that way, there needs to be some common context transfer across library boundaries, which again requires API-first design (although with a smaller API surface than the actual trace points).

I don't follow, it's almost like you are making an argument FOR a common wire transport protocol? The only thing I can deduce here is that you are calling the current transport API defined as a [gRPC Service] (https://github.com/open-telemetry/opentelemetry-java/blob/master/sdk/src/main/proto/trace/v1/trace.proto#L176) the "API" in this "API-First" reference? If this is the case I'll say yet again but attempt to clear up terminology confusion: the current gRPC API has fundamental design issues that make it unsuitable as a foundation to build upon.

Your argument and the argument of others has been that it's hidden from the user and we can change it later, this is both central to my argument (transport API is hidden from user in library API) and entirely false (it will not be changed later). The simple fact is that using this gRPC service as a transport API has heavy influence on the user facing library API because it is, itself, a library API. It's natural when you have a hidden library API (gRPC Service) that takes a Span with a start and end time that your library requires... a start and end time. Which one again we arrive to my central argument: the lack of a transport API (common wire format) in favor of a library API built upon another library API (gRPC generated service) which provides a free transport API (gRPC) has heavily influenced the design.

I think if I continue it will sound negative in tone or that I don't support the efforts here, which is not the case. I admit I am a bit frustrated since some of the foundational design decisions I vehemently disagree with are not open for technical debate. I find little merit in the oxymoron We can't change this right not due to "backwards compatibility", but "we can change it later"!., but am forced to concede to it while watching the user facing library API be influenced by the rigid constraints it imposes. Serious technical issues that should be fixed before widespread adoption by software.

Maybe the "W3C Distributed Tracing working group" comes up with something that supports all of my primary concerns. Maybe spans won't be large bundled data structures. It won't matter, the foundation the API was built upon does. This foundation which the project is headstrong on building upon influences the user facing library API. A more robust wire format incorporated down the road won't change the fact the API was already written on a foundation which imposed a more restrictive design. At which point our hands will be tied due to the massive number of dependencies. This is a simple fact.

@yurishkuro
Copy link
Member

You're the first person to make this statement as everyone else has openly stated backwards compatibility and software-reuse trumps ideal design.

I largely agree with "everyone else" on that. Both OT and OC have large user bases, and the single most important goal is to converge two projects into one, not to redesign completely to an ideal state (which could take multiple years). Yes, once released things are slow to change, but to me the question is whether the API fundamentally provides the wrong abstractions that do not allow innovation in the implementation space, or the API is flexible enough. If I am writing a library of framework in Java, all I need to know about logging is SLF4J and whether it allows me to express my telemetry output needs, not what data format the logs will be eventually written to.

you stated a detailed user story but provided no arguments as to why a common data format is harmful to that goal

Because data format locks you into an implementation detail. I have seen all kinds of uses of OpenTracing API that even had nothing to do with collecting telemetry (e.g. there was something about using OT to control unit tests, I don't recall the details). There is a difference between data format and data model, the API does imply certain conceptual data model, but it's much more nimble. For example, span.SetTag("a", "b") implies a conceptual data model of a tag as key-value pair. It does not imply any physical data format that says "there is a Span struct that contains a collection of Tags" - you could implement it that way, or as a form of a standalone event (set-tag, trace-id, span-id, key, value).

you are calling the current transport API defined as a gRPC Service the "API" in this "API-First" reference?

Absolutely not, I am talking about the language API, interfaces like Tracer, Span, etc.

Despite defending data format, you're making one point that I agree with - a specific SDK, exporter, or data format should not be dictating how the API is defined (although it may require some abstractions at the API level, e.g. the reason why SpanContext cannot be a fixed struct). If you see those serious issues, I would much rather have concrete discussion about them than a philosophical one. In my experience, the OpenTracing API managed to remain very nimble and unassuming about what lies beneath it. It was certainly not driven by any specific proto format, because it fundamentally had to accommodate many different vendors with vastly different implementations.

@cstockton
Copy link

Replying to your first two comments will just be a regurgitation of positions I've already stated. I'll focus on what appears to be common ground we have found:

you're making one point that I agree with - a specific SDK, exporter, or data format should not be dictating how the API is defined (although it may require some abstractions at the API level, e.g. the reason why SpanContext cannot be a fixed struct). If you see those serious issues, I would much rather have concrete discussion about them than a philosophical one.

First nothing here is philosophical, I've outlined the issues with requiring Spans to be fully constructed before shipping them in multiple places open-telemetry/opentelemetry-specification#20 (comment) - but I'll try to articulate them yet again, here.

First consider the spec -
https://github.com/open-telemetry/opentelemetry-java/blob/master/sdk/src/main/proto/trace/v1/trace.proto - I want to focus attention on the fact Spans must by design contain their full set of information, there is no partial span. A span is sitting in memory accumulating, or it is finished and no longer usable:

  // The start time of the span. On the client side, this is the time kept by
  // the local machine where the span execution starts. On the server side, this
  // is the time when the server's application handler starts running.
  //
  // This field is semantically required. When not set on receive -
  // receiver should set it to the value of end_time field if it was
  // set. Or to the current time if neither was set. It is important to
  // keep end_time > start_time for consistency.
  //
  // This field is required.
  google.protobuf.Timestamp start_time = 8;

  // The end time of the span. On the client side, this is the time kept by
  // the local machine where the span execution ends. On the server side, this
  // is the time when the server application handler stops running.
  //
  // This field is semantically required. When not set on receive -
  // receiver should set it to start_time value. It is important to
  // keep end_time > start_time for consistency.
  //
  // This field is required.
 google.protobuf.Timestamp end_time = 9;

Now before we leave the proto file and move on to the restrictions this imposes on the API, lets look at the problems we create for ourselves inside our own file first, starting with events:

  // A collection of `TimeEvent`s. A `TimeEvent` is a time-stamped annotation
  // on the span, consisting of either user-supplied key-value pairs, or
  // details of a message sent/received between Spans.
  message TimedEvents {
    // A collection of `TimedEvent`s.
    repeated TimedEvent timed_event = 1;

    // The number of dropped timed events. If the value is 0, then no events were dropped.
    int32 dropped_timed_events_count = 2;
  }

  // The included timed events.
  TimedEvents time_events = 11;

We have to include a number of dropped events, but why exactly is that? Well, it's an infinite accumulating buffer. There has to be some upper bounds for memory or a simple mistake of starting a span in a location of code that blocks in a for loop delegating work can OOM your program. Anything "repeated" on a span must have an upper bounds for safety, and we do:

Or we define upper bounds verbally:

So by hanging onto a span for the entire duration of it not only do we rob users of the very thing tracing is meant for (observability) we severely limit the use cases by requiring spans be short lived objects. I can't start a Span inside a worker and observe that worker is running as soon as it hits my ingest because it's not shipped until it's over, I can't see "open" spans. I can only observe things that happen. This by itself is the single biggest loss I've had adopting opentracing-go vs the internal system I designed which was strictly event based. The second an observation was made, I shipped it. We knew what transaction were open at any given time. We can start spans for any use case, let it block for the full duration a server runs, no big deal. Creating thousands of child spans, no big deal. The client library just ships the events, it doesn't care and it was fantastic. Never ran into a single problem.

OpentTracing has continuously surprised me, "too many child spans" for example was a hard lesson learned. I have to be so careful not to be a child of other spans, even when it is in fact a child because I have to be careful not to exhaust references, log records and so on. The bottom line is this:
tracing should be friction-less, your penalty for misuse of a Tracing library should not be a loss of observability. We create this problem by using the current transport above, I find it completely unacceptable.

So how does it impact the client library, well it's rather obvious but to start each one of those upper bounds limitations imposed by the transport (attributes, log events, etc) all implicitly affect the library. Lets begin with a concrete example, eventually: https://github.com/open-telemetry/opentelemetry-go/blob/master/api/trace/api.go#L64 - will need to make mention that there is an upper bound to the number of events, so people don't make my mistake and find log records have been randomly lost when it's most important (when suddenly a thing was tried much more times, you want to know why, most of all).

You could say that "behavior" is not actually influencing the API, which I would disagree with, but happily switch to something else concrete - https://github.com/open-telemetry/opentelemetry-go/blob/master/api/trace/api.go#L53

	// Finish completes the span. No updates are allowed to span after it
	// finishes. The only exception is setting status of the span.
	Finish()

Notice: "no updates are allowed to span" - why is that? It's because it's impossible. We are binding attributes and events to the Span, if it was already sent, that's it. There is no buffer to accumulate onto anymore. So what happens? Do we panic? Do we just throw away the data? Who knows. We don't have to go far to find multiple examples of this design decision negatively impacting the API, at this point though I should just print the entire (massive) interface:

type Span interface {
	// Tracer returns tracer used to create this span. Tracer cannot be nil.
	Tracer() Tracer

	// Finish completes the span. No updates are allowed to span after it
	// finishes. The only exception is setting status of the span.
	Finish()

	// AddEvent adds an event to the span.
	AddEvent(ctx context.Context, event event.Event)
	// AddEvent records an event to the span.
	Event(ctx context.Context, msg string, attrs ...core.KeyValue)

	// IsRecordingEvents returns true if the span is active and recording events is enabled.
	IsRecordingEvents() bool

	// SpancContext returns span context of the span. Return SpanContext is usable
	// even after the span is finished.
	SpanContext() core.SpanContext

	// SetStatus sets the status of the span. The status of the span can be updated
	// even after span is finished.
	SetStatus(codes.Code)

	// Set span attributes
	SetAttribute(core.KeyValue)
	SetAttributes(...core.KeyValue)

	// Modify and delete span attributes
	ModifyAttribute(tag.Mutator)
	ModifyAttributes(...tag.Mutator)
}

Consequence - upper bounds needed for safety, I guess it beats losing my entire span like I do in OpenTracing?

	// IsRecordingEvents returns true if the span is active and recording events is enabled.
	IsRecordingEvents() bool

Consequence - more API surface area to control knobs due to above: (https://github.com/open-telemetry/opentelemetry-go/blob/master/api/trace/api.go#L157)

// WithRecordEvents enables recording of the events while the span is active. In the absence of this option, RecordEvent is set to false, disabling any recording of the events. 
func WithRecordEvents() SpanOption

Consequence - because this thing is in memory, we might as well add knobs for the values:

	// Set span attributes
	SetAttribute(core.KeyValue)
	SetAttributes(...core.KeyValue)

	// Modify and delete span attributes
	ModifyAttribute(tag.Mutator)
	ModifyAttributes(...tag.Mutator)

I could continue on, but the first paragraph was sufficient evidence how the current structure of spans is imposing serious design issues. What bothers me so much is that something better (streaming observations as they are made, or change) is actually less work, would allow much cleaner user facing ergonomics, lower cognitive burden of instrumenting code, remove some nasty gotchas (dropping) and open up an entire new class of tracing which I dearly miss: long running spans, with a status of being "open" until any point in the future when "Finish()" is called. I've tried to mimic this behavior but you don't see the span until it exits and it's risky because if you accidentally start a span that references the parent span (yea, it's silly that you have to avoid causality due to this design decision, but lets accept that for a second) or make log calls without starting a new root span then eventually the buffer span will overflow and be dropped.

Tracing as designed now is a mine-field and user burden due to the design constraints imposed by the underlying gRPC transport (monolithic one shot data structure). I don't know how else I can show, or explain it. It's announced all throughout the code.

@yurishkuro
Copy link
Member

@cstockton I disagree with your main statement that there is a requirement somewhere implied by the API that span needs to be kept in memory and no telemetry can leave the app until the span is finished. A Span is supposed to be write-only interface, and the API does not prevent the implementation from shipping the data as soon as individual write methods are called on the span. Most of your arguments are based on the proto definition, not the API. And yes, the SDK implementation does hold onto spans until finished, but that's not the only possible implementation, just the one that has been most commonly used in both OT and OC worlds, in large part because that's how most tracing backends are implemented.
This is the case where the argument "we can fix it later" very much applies, as long as the API doesn't lock you in the accumulation behavior, which I don't think it does.

Another thing is, you keep referring to the Go API, which I think is not even in the alpha state. For example, ModifyAttribute() is completely surprising to me, as it implies read/write access to the span (I opened a ticket #53 to remove it). There is no such method in the Java API. The comments/docs of the Finish() method are similarly weird (why is status allowed to be set after finish?) and do not match Java API. I have not been following Go API closely, but I expect it to conceptually converge to what Java API already defined.

In summary,

  • if you think there are places in the API that lock the SDK into holding onto the span, then I think they need to be fixed
  • if you argue for implementing the SDK as streaming rather than accumulating, then I'm afraid you're in the minority for the time being, because that's not just the SDK concern, the whole architecture of the tracing backends may need to be changed, so making it a requirement for OpenTelemetry v1 would massively undermine its usefulness to existing vendors and oss tracing platforms, many if which expect to receive complete spans from the SKD.

@cstockton
Copy link

@yurishkuro I guess we can agree to disagree, I just don't know how else to explain this. Perhaps it's because I've implemented distributed tracing as an event system before there was any standardization efforts years ago where everything was just a stream of small events with different identifiers: log lines, spans starts / end and so on that I see how the current system was influenced. If I had to sum it up I would say that Spans currently feel like a container (because they are), when you have a container you are forced to think about upper bounds for the data it holds as well as all the nuance of object lifetimes.

Even so, it's not so just about what I see with the API that prevents streaming of events, it's about the fact that it's not the ideal API for a stream based system. In such a system you wouldn't see Span be such a predominant object because the coupling is no longer in the container, it's referential - freeing you from all the bounds constraints and lifetime rules. A Span is just a conceptual moment in time, available to observe the moment it starts and again when it's finished. In other words, this API is not the API we would collectively settle on given a more robust streaming backend and the way people instrument their code would be much different. I know this because I've seen the loss of visibility in own code bases before OpenTracing and after.

I think I've provided several other indicators of the API being influenced by the design of Spans, and more importantly how the design impacts the way you instrument code and how much value the instrumented code provides. I see no mentions of my dropped span complaints that are a very real obstacle, losing spans due to a simple mistake, but only surfacing under certain conditions (when you need them the most) is a major issue to me. Tracing should be rock solid, something you can rely on to provide insights when you really need them. The system is designed to be lossy from the ground up because of the underlying transport. Even if you change the transport to support streaming down the road, from now until that day every single line of code will be written with the cumbersome constraints imposed by span lifetimes and silent upper bounds truncation.

But your second point is taken to a degree, why waste effort for something no support exists for. But I would ask you: will they ever exist if there is no frontend for them to digest from? Probably not, however this would have been a fantastic time (the beginning of convergence, fresh code bases, etc) to have done it, or at least provided the option to do it, just move the span buffering down stream, drop your spans etc at the local collectors. I get it though, ship has sailed at this point.

We are stuck with a less than ideal solution with limited use cases to rush a convergence effort driven by heavy commercial interest at the cost of community churn in order to essentially delete one of two completely working ecosystems. Good luck telling me in anything less than 4 years to change to OpenTelemetry V2 to support basic features & use cases I had access to 8 years ago after I had just spent the last 2 years replacing OpenTracing with OpenTelemetry after spending 2 years advocating we use OpenTracing. What a mess. I'll just be glad when it's over and settle with whatever I get.

Thanks for listening to my concerns though, even if we disagree, it helps to be heard and thanks for the efforts you and all the other contributors are spending on this.

@jmacd
Copy link
Contributor

jmacd commented Jul 16, 2019

@cstockton I agree with you on the goals you set above, but there seems to be some misinterpretation happening. My original prototype for this repository was a streaming SDK, there has never been any accumulation of state inside a Span object in that implementation, and the APIs we have today are compatible with those goals. Given that I think we agree, and the current API meets our stated goals, I'm not sure the long issue thread here is productive.

I just posted PR55 for this SDK demonstrating your point that the API should not be concerned with detecting duplicate Finish() calls.

I would like to hear what you think about #52, which relates to one of your other points in the thread here. I understand your objection to injecting a dependency via Go's plugin mechanism, but that issue and the associated PR were made to explore whether "zero touch instrumentation" is a requirement, for if we have that requirement and a desire to support alternative SDKs, we need some way to perform automatic dependency injection, which is not easy to do with Go.

@cstockton
Copy link

@jmacd I explained in detail why I don't feel the API meets my goals but have came to accept that it does meet the goals of the OpenTelemetry project.

As for pull req #55 I think the work done in *span is great overall, it removes a lot of bloat. However I think it actually works against my underlying concerns over the double finish, which is that an observation could panic because the scope is missing. Once a span is finished it is eligible to be removed in https://github.com/open-telemetry/opentelemetry-go/blob/master/experimental/streaming/exporter/reader/reader.go#L296 which deletes it from the scope, but a Finish done later will panic if it cant find a scope at - https://github.com/open-telemetry/opentelemetry-go/blob/master/experimental/streaming/exporter/reader/reader.go#L170 - these dangers are repeated everywhere. This just goes back to my concerns about state management and the complexity the current design imposes.

Here is what I have been doing periodically for the project: egrep -r -A5 -B5 -- '(<-|->|panic|recover|sync.Map|map|[Dd]elete|close|append)' [a-z]* - as those are some good prime indicators of race conditions, bugs, and anti-patterns. Running it reveals many dangerous ingredients - panics, background goroutines, sends without context, potential double closes, dangerous use of primitives that could cause deadlocks / or panics, race conditions with append() all wrapped up in a global registry. Here are some things that I see in my latest grep:

As for #52, I think the entire notion of "touch free" is just completely unreasonable, OpenTelemetry is a project in service to software engineers. We make a career from writing software, I think we can manage importing a few more lines if it means we get more reliable and carefully designed libraries. I guess I can't really comment here beyond that, you're doing the best you can given the requirements. Thanks for the work you're doing and feel free to tag me for code reviews.

@jmacd
Copy link
Contributor

jmacd commented Jul 16, 2019

@cstockton The implementation you're looking at has "experimental" in the name because it's not meant for use. It establishes that the API does not restrict us from having a streaming implementation, that is all.

In a production-ready streaming exporter, everything behind the observer.Observer interface would be replaced by a small library (*) to encode events and write them to the wire, which I think is what you've been asking for. I, like you, do not want to see diagnostics data handled in-process, and would have the exporter/reader logic, in particular, implemented by another process.

Not sure what you meant about "foreign" packages.

How concerned are you about memory allocations? In issue 52, I referred to a current proposal to eliminate the API interfaces and SDK concept, to leave behind a concrete API and exporter interfaces (https://github.com/iredelmeier/rfcs/blob/sdk-as-exporter/text/0000-sdk-as-exporter.md). If we took that approach, the Span interface could become a plain struct, and then we wouldn't allocate a span object at all. I sometimes think we carry this type of argument too far--there's still an allocation behind the context.Context when starting a span and at some point I think it's better to be idiomatic than to be efficient.

Well, this issue began as being about "so many interfaces", but most of the concerns I'm seeing are about the implementation behind the interfaces. I believe in these interfaces, since they allow more than one implementation to exist, which means OpenTelemetry can develop streaming SDKs without changing the interfaces.

(*) It would be hand-coded, e.g., a typical protobuf library is too expensive/risky for the same reasons you mention.

@cstockton
Copy link

cstockton commented Jul 16, 2019

@jmacd I understand it's experimental, but looking at the code was necessary to understand your pull request.

Not sure what you meant about "foreign" packages.

The streaming reader defines Event which contains references to tag.Map which accepts MapUpdate structs. The streaming readers Observe implementation makes calls to the tag.Map interface concurrently, thus relying on the concrete implementation of an external interface for thread safety. Since I assume the reason it is an interface at all is to provide user/vendor declared tag.Map I don't think the design will age well.

p.s. found another race condition while writing this.

How concerned are you about memory allocations?

Very, but much of it is easily fixable little stuff like reducing autotmp because of placing values into interfaces etc. I am also concerned about copy pressure due to the the pass-by-value of large structs (e.g. observer.Event), but that is also fixable with minor changes.

In issue 52, I referred to a current proposal to eliminate the API interfaces and SDK concept, to leave behind a concrete API and exporter interfaces (https://github.com/iredelmeier/rfcs/blob/sdk-as-exporter/text/0000-sdk-as-exporter.md). If we took that approach, the Span interface could become a plain struct, and then we wouldn't allocate a span object at all. I sometimes think we carry this type of argument too far--there's still an allocation behind the context.Context when starting a span and at some point I think it's better to be idiomatic than to be efficient.

At first glance I thought this was actually close to what I've been proposing, but a couple things are throwing me off. The API portion here is defined outside the end-user process it made me think it was a network transport for observation data. But if that is the case what needs is there for third-party library code in the end-user process? It seems to still have an interface for exporters in the library, e.g. does vendor Acme Inc provide a FileExporter - or does OpenTelemetry provide one?

Basically my unwavering position here is this:

  • OpenTelemetry defines a common structural representation of observation data
    • I think all observations, metrics, spans, and "user defined" should all fit in a single data model if we are clever
  • OpenTelemetry defines an initial streaming protocol to carry observation data
    • this should be stateless with a simple batching / ACK cycles
    • meaning it can work via tcp or sending to files, e.g. any io stream serves as a valid backing
    • It can be simple json to start. Remember it is much easier to add transports in the future because it's transparent to the user, as long as the structural representation is extensible and the semantics of streaming are sane the user facing API does not change if you provide a more efficient codec. They may simply update the client library and have access to a shiny new binary encoding over QUIC, who knows.

This allows an extremely efficient and simple client library. When you rely on sending things to background for performance because you have complex in-process pipelines it leaves a single choice as buffers fill up: drop events - but in some cases I would rather create back pressure then lose an important insight into my software. If we are on the same page here then that is awesome, I hope your proposal is accepted! Otherwise I'll need some time to read it over again to understand the trade-offs you selected.

Span interface could become a plain struct, and then we wouldn't allocate a span object at all. I sometimes think we carry this type of argument too far--there's still an allocation behind the context.Context when starting a span and at some point I think it's better to be idiomatic than to be efficient.

I think I am following you and like where this is going, I can share with you what our internal telemetry libraries manifested as in Go:

This is the main instrumentation library, Option is implemented by a variety of types, for example:

  • Fields akin to opentracing/log fields may be passed as an Option
  • Forming additional span relationships beyond the current context, for example I have an api.TraceMeta embedded in many core API types, which returns it's original request span ID. I can simply pass any API object to an emit func and get causality.
  • Many API types that are valid options also provide one or more fields so they may log themselves, meaning instrumentation is simple, I just do something like emit.Log(ctx, "api.typename.updated", prev, new) and move on.
  • Note: Begin is "a new root span", Start is "FollowsFrom", enter is ChildOf - these are just artifacts of nomenclature before the tracing specs existed and I knew what "Span" was.

I make heavy use of context values and options throughout my code base to change the current behavior of derived spans, for example you can instrument operation or field names with additional gauges / counters via options (i.e. WithCounter, WithInFlight). As you can see everything is contextual, this prevents a lot of problems and meshes will with Go because context.Context has adapted well into the ecosystem since its introduction. I setup a base context value (package internal/ctxval in most repos) which contains the common context values along with Tracer and Logger in the very start of my app in main(), I extract and pass them along as needed to inject into request paths. For producer/consumer scenarios since api types implement api.TraceMeta I can just pass the API object itself in worker scenarios as an option to Begin() and the span reference is captured.

In summary you will notice I strive to have a very minimal interface for tracing, rather than create a "complex system" that auto magically does everything, it's the programmer job. So I give them a simple job:

package emit

// Begin a new root span unrelated to any prior operation.
func Begin(ctx context.Context, operation string, option ...Option) context.Context { ... }

// Start a new span with a lifetime that may exceed the current span.
func Start(ctx context.Context, operation string, option ...Option) context.Context { ... }

// Enter a new operation which should not exit before all child calls to Enter have completed.
func Enter(ctx context.Context, operation string, option ...Option) context.Context { ... }

// Close the current span, no future spans related to this one will begin.
func Close(ctx context.Context)  { ... }

// Error will return nil immediately if err is nil, or report err to logs,
// tracing system and return the same error. It records the typ by checking the
// optional `interface Typer { Type() string }` or the Cause() underlying type
// name, if neither is found it uses reflection for %T.
func Error(ctx context.Context, err error, option ...Option) error  { ... }

// Log an event to the currently active span within ctx. 
func Log(ctx context.Context, event string, option ...Option)  { ... }

Package emit uses package tracer & logger under the hood, which both provide simple methods to extract the active logger & tracer:

package tracer
func FromContext(ctx context.Context) Tracer  { ... }
func WithContext(ctx context.Context, tr Tracer) context.Context  { ... }

package logger
func FromContext(ctx context.Context) Logger  { ... }
func WithContext(ctx context.Context, lr Logger) context.Context  { ... }

Usage is simple, all throughout my software I just copy paste:

func Handle(ctx context.Context, req *api.AcmeRequest) error {
  ctx = emit.Begin(ctx, "acme.Handle", req, metrics.WithCounter())
  defer emit.Close(ctx)

  err := nextpkg.process(ctx, req)
  return emit.Error(ctx, err, req) // last touch point calls emit.Error to use the "error tag extension"
}

func (t *T) process(ctx context.Context, req *api.AcmeRequest) error {
  ctx = emit.Enter(ctx, "nextpackage.Process", 
    t.endpoint, // (*api.Endpoint implements option)
    field.Prefix("backend", t.endpoint), // (which makes it composable, add "backend" to field names)
    field.Int("processor.id", t.id),
    field.String("processor.name", t.name))
  defer emit.Close(ctx)
  return andOnAndOnWeGo(ctx, req)

Maybe this provides some inspiration or maybe it seems unreasonable. But maybe it gives some insight why I really believe tracing doesn't need so much code, and trying to hide it from the developer doesn't resonate with me. Providing patterns, wrappers, middleware and utilities to translate request paths to observations is great. Trying to hide the entire API from developers with zero touch .. we are software engineers, we will be okay with some basic responsibility, I promise :- ) Lets not forget what our user base is, tracing SDK's do not have the diverse user base of an iphone.

Small note: "emit" is an artifact of internal nomenclature. From scratch in the public space, I would have an API just the same, but named more familiarly to project:

ctx = span.Start(ctx, "my.op", ...)
defer span.Close(ctx)

I would then heavily debate if more top-level span factorys are even required, I have Begin Start Enter as a consequence of the old internal tracing, but relations can easily be passed via options, so a single span.Start may cover all 3 roles very nicely:

ctx = span.Start(ctx, "my.op", span.ChildOf, span.FollowsFrom, ...)
ctx = span.Start(ctx, "my.op", span.ChildOf(someCtx)) // or maybe ?
defer span.Close(ctx)

Haven't thought it through carefully but that is where I would start my thought exercise.

@jmacd
Copy link
Contributor

jmacd commented Jul 16, 2019

@cstockton Thanks for sharing this example API. That's really helpful. @iredelmeier <- Note.

I think we're pretty well aligned on the ideal of a streaming SDK implementation. It's a good direction, but requires a lot more support than it has to let it interfere with current proposals and timeline. That is why I encourage everyone to focus on API-level concerns and the specification, not the current implementation in any of the languages. In other words, if you have a better implementation idea, all that matters right now is whether it can be done without changing the API. We should be talking about which interfaces trigger unnecessary allocations. For example, I'm less concerned with the allocation of a *span to satisfy a Span interface and very concerned with the allocation of a *event to satisfy an Event interface (to support laziness, see #57.

I'm glad you're pushing me to make the experimental/streaming/sdk reach production quality. It needs to be developed alongside an agent or sidecar that can process the data and export it in a way compatible with non-streaming exporters, in order to become practical. That's why I still view it as a proof-of-concept, and am not worried about the reader package. If I could choose a language to implement the reader in, I'm not sure I'd choose Go.

When I first conceived this code, I was thinking that for the proof-of-concept I would simply (JSON- or otherwise) encode that type and read it in another process, that was my rationale for using "one large struct" instead of a bunch of more tailored virtual interfaces. My take (now), especially considering https://github.com/iredelmeier/rfcs/blob/sdk-as-exporter/text/0000-sdk-as-exporter.md, is that it would be best if the exporter supported a number of low-level virtual interfaces (e.g., Start, Finish, Log, ...) instead of "one large struct". As you've said, it would be great to specify a low-level event structure from which the higher-level exporters are defined. I'm not saying observer.Event is a good solution to this need, but it is a solution to this need.

You raised a good question about memory pressure vs "copy pressure" and the observer.Event type. I made everything in observer.Event a plain struct to avoid memory allocations. I do not believe the CPU cost of this copying is something to be concerned about, but maybe you're thinking about extra blocks of stack allocation? I think this is addressed by more-tailored virtual interfaces.

@cstockton
Copy link

@jmacd Glad to help, I also think we are on the same page now in a general sense. It makes me happy to know you have similar underlying design principles, I think with your understanding of these efforts at large you were a lucky contact point for me to run into. I look forward to watching the progress and will be happy to participate in any future discussions or reviews.

For the point I made about copy pressure, the event struct is pretty big at ~224 bytes so cost of calls can begin to add up. But beyond that it tends to get worst over time as people build around it, creating []Event slices, autotmp due to storing in interfaces and so on. It's not a big deal really, but just something I noticed and has lots of easy fixes. You could have the user-facing api be observer.Observer(Event) but pass by ref in the internals via Observe(*Event) interface and still keep any mutability properties you're after. The only issue with that is the call indirection on the interface method, but some rules on the observer interface could possibly allow pooling to amortize to zero. Just food for thought, thanks again.

@rghetia
Copy link
Contributor

rghetia commented Jul 16, 2019

@freeformz do you think we can close this?

This was referenced Jul 16, 2019
@freeformz
Copy link
Contributor Author

@rghetia Yes.

But before I do, I want to state that I generally agree with @cstockton on multiple points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests