Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEP: Define ResourceProvider #4316

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
395 changes: 395 additions & 0 deletions oteps/4316-resource-provider.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,395 @@
# Resource Provider

Define a mechanism for updating the resources associated with an application.

## Motivation

Resources were originally defined as immutable. For the common cases related to
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
server-side application development, because the lifespan for most resources associated
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
with server-side applications either match or outlive the lifespan of the application.

However, it turns out that not all swans are white, and some resources utilized
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
by applications change while the application is still running. This is especially
true in client-side applications running in the browser and on mobile devices.

Examples of resources whose availability may change include networking (wifi, cellular, none),
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
application state (foreground, background, sleeping), and session management (sessions
starting and ending without the application being shut down or the browser being
refreshed).

Tracking these resource changes are critical. Without them, it would be impossible
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
to segment the telemetry correctly. The lifespan of a session is a far more important
segmentation than the lifespan of an application instance, as the application lifespan
is often somewhat arbitrary. The performance of an application when it is foregrounded
cannot be understood when the foregrounded telemetry is mixed with backgrounded
telemetry. Failure modes may exist when network availability drops due to a switch
in networking – how an application performs when it has access to wifi vs when it
does not is a critical distinction.

## Explanation
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

Changes to resources and entities are managed via a ResourceProvider. When the resources
represented by an entity change, the telemetry system records these changes by updating
the entity managed by the ResourceProvider. These changes are then propagated to the
rest of the telemetry system via EntityListeners that have been registered with the
ResourceProvider.

The loose coupling provided by a ResourceProvider allows each subsystem to focus
on their various responsibilities without having to be directly aware of each other.
For a highly extensible cross-cutting concern such as OpenTelemetry, this loose
coupling is a valuable feature.

## Internal details
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

Like the other Providers used in OpenTelemetry, the ResourceProvider MUST allow
for alternative implementations. This means that the ResourceProvider API and
the ResourceProvider implementation we provide MUST be loosely coupled, following
the same API/SDK pattern used everywhere in OpenTelemetry.

### EntityListener

An EntityListener MUST provide the following operations:

- `On EntityState`
- `On EntityDelete`

#### On EntityState

`On EntityState` MUST accept the following parameters:

* `EntityState`: represents the entity that has changed.
* `Resource`: represents the entire set of resources after the entity changes
have been applied.

#### On EntityDelete

`On EntityDelete` MUST accept the following parameters:

* `EntityDelete`: represents the entity that has been deleted.
* `Resource`: represents the entire set of resources after the entity
has been deleted.

### ResourceProvider

A `ResourceProvider` MUST provide the following operations:

* `Update Entity`
* `Delete Entity`
* `Get Resource`
* `On Change`

For multithreaded systems, a lock SHOULD be used to queue all calls to `UpdateEntity`
and `DeleteEntity`. This is to help avoid inconsistent reads and writes.

The resource reference held by the ResourceProvider SHOULD be updated atomically,
so that calls to `GetResource` do not require a lock.

Calls to EntityListeners SHOULD be serialized, to avoid thread safety issues and
ensure that callbacks are processed in the right order.

#### ResourceProvider creation

Creation of a ResourceProvider MUST accept the following parameters:

* `Entities`: a list of entities.

Internally, the entities MUST be merged in the order provided to create the initial
resource.

#### Add Entity

`Add Entity` appends a new entity on to the end of the list of entities.

Add Entity MUST accept the following parameters:

* `ID`: the ID of the Entity being created.
* `name`: the name of the Entity being created.
* `attributes`: the set of attributes associated with the entity.

After an entity is created, it MUST be appended to the list of current entities.
A new resource object MUST be generated by merging the list of entities together in order.

`Add Entity` MUST trigger the `On EntityState` operation for all
registered `EntityListeners`.

#### Update Entity

`Update Entity` replaces the resource attributes associated with an entity.

Update Entity MUST accept the following parameters:

* `ID`: the ID of the Entity being updated.
* `attributes`: the new set of attributes associated with the entity.

After an entity is updated, a new resource object MUST be generated by merging
the list of entities together in order.

`Update Entity` MUST trigger the `On EntityState` operation for all
registered `EntityListeners`.

#### Delete Entity

`Delete Entity` replaces the resource attributes associated with an entity.

Update Entity MUST accept the following parameters:

* `ID`: the ID of the Entity being updated.

After an entity is deleted, a new resource object MUST be generated by merging
the list of entities together in order.

`Delete Entity` MUST trigger the `On EntityState` operation for all
registered `EntityListeners`.

#### Get Resource

`Get Resource` MUST return a reference to the current resource held by the ResourceProvider.

#### On Change

`On Change` registers an `EntityListener` to be called every time an entity is updated
or deleted.

### SDK Changes

NewTraceProvider, NewMetricsProvider, and NewLoggerProvider now take a ResourceProvider

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add links; I do not really understand what you want here.
Also, that sounds to me as as if a lot of implementations would break since this cannot be done gracefully in languages lacking proper overload resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are allowed to issue new major versions of the SDK constructors, but it is best if we can avoid it. Implementations lacking overload are architected to use other forms of extensibility to allow for additional config parameters, so that part is not really a problem. But we have identified that the Go SDK may break for other reasons, and we need to be careful.

If you're asking about Providers in general, I would suggest reading the tracing and logging portion of the spec, it will give you a good idea of the patterns that we use in the SDK design. Looking at metrics is also a good idea, but it's a bit more complicated than the other signals.

as a parameter. How SDKs handle resource changes is listed under [open questions](#open-questions).

## Trade-offs and mitigations

This change should be fully backwards compatible, with one potential exception:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is so fundamental that I would not expect it to be compatible in any way. Hence, I will not even start discussing your use case below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of the data model and the wire protocol, it is a hard requirement that this change is 100% compatible. That is way any proposal must preserve the Resource section of the data model in a way that allows older systems to continue to work.

When we look at compatibility in Otel, there are three specific areas:

  • The Instrumentation API must remain 100% compatible to prevent dependency conflicts.
  • OTLP and the data model must remain 100% backwards compatible from the perspective of existing systems that want to continue to consume the old data model.
  • The SDK can have compatibility broken, but it should be avoided. If breaking changes are implemented, it is best that it continue to work in a hybrid mode for at least some time before the deprecated components are removed.

fingerprinting. It is possible that an analysis tool which accepts OTLP may identify
individual services by creating an identifier by hashing all of the resource attributes.
Comment on lines +160 to +162
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean this will be in a v2 of the SDK given this is backwards incompatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe it would, especially if all of the details about provider setup are exposed (which they are in v1). In a world where users have a config file and a NewSDK constructor that encapsulates all of these details, I think it would break far fewer users. But it would still be a major version bump, I don't see how it couldn't be.

In general, it seems like switching to entities would change how resources are detected, and I imagine that alone would probably create a breaking change to SDK setup.

Copy link
Contributor

@jsuereth jsuereth Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had been goinng through contortions to try to avoid a version bump.

If we think it's easier to just bump the sdk version - we need to discuss that with sdk maintainers but it would give us a lot of options in API design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Go may be in a special situation here.

Breaking the SDK is not great, it would be better if there is overlap and a grace period rather than a hard break.

For reference, this is the upgrade strategy we use, which depends on users being able to push to the latest version of the SDK. We don't want to create a situation where maintainers have to maintain two branches.


In practice, the only implementation we have discovered that does this is the OpenTelemetry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus target_info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus target_info makes a hash of the entire resource object? It doesn't have a list of specific identifying attributes that it uses?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it uses service name/namespace/instance.id and is, hence, equally not compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. How would adding a transient attribute such as session.id or user.name to the resource object affect target_info?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way target_info is currently implemented requires that the set of ressource attributes does not change. If some software relies on that property and you change it, that change breaks that software.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I checked with the Prometheus maintainers about target_info. They say it only uses a small number of specific attributes such as service.name, etc. That makes target_info a good example of something that would not be affected by the changes being proposed, as those keys and values would not be perturbed in any way.

Go SDK. But the Go SDK is not a backend; all analysis tools have a specific concept
of identity that is fulfilled by a specific subject of resource attributes.

Since we control the Go SDK, we can develop a path forward specific to that particular
library. That path should be identified before this OTEP is accepted.

Beyond fingerprinting, there are no destabilizing changes because the resources
that we have already declared "immutable" match the lifespan of the application
and have no reason to be updated. Developers are not going to start messing with

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make such statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where these changes break the "immutability" of resources is a question the community continuously asks for clarification on, so it is important that we address it here. But I'm sure that this section could be phrased better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't get anything from immutability if you, in turn, drown the callers in a stream of immutable objects; the result is inconsistency and inefficiency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but that's actually why using the term "immutable" in the spec was never a very accurate description. It only meant "immutable" from the perspective of the local object in code. Not being specific about how we use that word has definitely led to a lot of confusion.

the `service.instance.id` resource arbitrarily just because they can, and resource
detectors solve the problem of accidentally starting the application while async
resources are still being fetched.

## Prior art and alternatives

An alternative to updating resources would be to create span, metrics, and log
processors which attach these resource attributes to every instance of every
span and log.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

There are two problems to this approach. One is that the duplication of attributes
is very inefficient. This is a problem on clients, which have limited network
bandwidth and processing power. This problem is compounded by a lack of support
for gzip and other compression algorithms on the browser.

Second, and perhaps more important, is that this approach does not match our data
model. These application states are global; they are not specific to any particular
transaction or event. Span attributes should record information specific to that
particular operation, log record attributes should record information specific to
that particular event. The correct place in our data model model for attributes
that identify the application, describe the environment the application is running
in, and describe the resources available to that application should be modelled
as resource attributes.

## Open questions

The primary open question – which must be resolved before this OTEP is accepted –
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume pending acceptance of this OTEP we'd begin experimenting here?

This seems like the biggest point to address ASAP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this relevant at all? Spans are produced after the end of activity and the snapshot is taken then. Identifying attributes need to remain stable anyway, so the change of descriptive attributes should be know by the backend and the backend can provide adequate information where required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to prototype extensively before accepting this OTEP. Like all OTEPs, it should not be accepted until we can link to working examples that illustrate how we plan to resolve these concerns.

is how to handle spans that bridge a change in resources.

For example, a long running background operation may span more than one session.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
Networking may change from wifi to a cellular connection at any time, a user might
log in at any time, the application might be backgrounded at any time.

Simply put, how should the SDK handle spans that have already started when a resource
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
changes? What about the logs that are associated with that span? EntityState can be
used to record the exact moment when these values change. But resources need to act
as search indexes and metric dimensions. For those situations, we only get to pick
one value.

The simplest implementation is for the BatchProcessor to listen for resource changes,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naive implementation could lead to a degenerate case, for example, on mobile, where something like a network connection quickly oscillates between connected and unconnected. This would effectively nerf the batching, cutting a new batch a few times a second (which is big in mobile).

Whether this case needs to be handled well, or simply called out so that implementations can protect itself from it, is something we should address. I don't think a that we need so solve this well at this point for us to move this forward.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we should learn from this observation is that descriptive attributes cannot be sent with the data but must be sent through another channel instead. IIRC, there already is a proposal for such a channel.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, reaching this point, I no longer like the network example as it seems to show that it should not have been a resource attribute in the first place. I'd rather use service instances migrating between hardware or threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO regarding flagellations + descriptive attributes - I think we need to encourage (but not force) users not to have descriptive attributes in RESOURCE, but allow them in ENTITY. Additionally, we should prefer NOT to identify source of telemetry with something that is volatile. These concerns should be addressed in how we model entity types within Semantic Conventions.

We discussed this a bit in person, but effectively things that MAY be stable in a server-side context MAY NOT be stable in a mobile context (or less stable). So choice of which "entities" to use may need to be localized to an application/service deployment. OTEL needs to be flexible enough to support these both. What we should NOT do is prevent valid identities on Mobile because they don't work in servers or vice-versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but let's be careful to not overcomplicate things with a lot of philosophy! We run into a lot of trouble and bike shedding when we lose track of the practical usage of this data.

Resources are labels applied to all of the data in a batch. Entities are groups of labels. EntityStates are a timeline of changes with precise timestamps as to when the change occurred.

From that perspective, we can ask the question: if a resource such as the networking stack thrashes, or even changes once in the middle of an operation, what label (or labels) is the most helpful to put on that batch of data? And at what granularity should we be segmenting batches of data? The fine grained history is in the EntityState stream. Anything in resources is going to be very coarse grained. So we need to pick a course-graining strategy, based on how we intend backends to make use of the resource labels as a practical manner – not based on a philosophical theory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, perhaps some entities can be marked as "volatile" in a way that causes them to stack up in the same batch rather than trigger a flush.

and to flush the current batch whenever a change occurs. The old batch gets the old
resource, the new batch gets the new resource. This would be easy to implement,
but is it actually what we want? Either as part of this OTEP or as a quick follow
up, we need to define the expected behavior for the BatchProcessor when it is listening
for resource changes.

## FAQ

### Is there some distinction between "identifying resources" and "updatable resources"?

Surprising as it may be, there is no direct correlation between an attribute being
"identifying" and that attribute matching the lifespan of an application.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

Some resources are used to identify a service instance – `service.name`, `service.instance.id`, etc.
These resources naturally match the lifespan of the service instance. An "immutability requirement"
is not necessary in this case because there is no reason to ever update these values.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

Other resources are used to identify other critical lifespans, and these values
do change. For example, the user's identity may change as the user logs in and out
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
of the application. And multiple sessions may start and end over the lifespan of
an application.

Therefore there is no need to conflate "identifying" with "immutable." Telemetry
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
simply models reality. If we change a resource attribute that is not supposed to
change, that is an implementation error. If we don't change a resource attribute
when the application state changes, that is also an implementation error. With the
correct tools these errors are unlikely, as it is very obvious when these individual
attributes should and shouldn't change.

### Why were resources immutable in the first place?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. is there any prior art for a OTel-like resource concept in observability?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a few things.

From OpenCensus - https://github.com/census-instrumentation/opencensus-specs/tree/master/resource
From Stackdriver (Now Google Cloud Observability) - https://cloud.google.com/monitoring/api/resources

There may be others. I have a write-up (in Entities WG notes document) about lessons learned from Google Cloud's resource design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most proprietary APM products have something like this in them, but they are closed source so I can't point at the code.


Use of the term "immutable" points at the real reason this requirement was initially
added to the specification. When an application initially boots up, gathering some resources
require async operations that may take time to acquire. The start of the application
must be delayed until all of these resources are resolved, otherwise the initial
batches of telemetry would be poorly indexed. This initial telemetry is critical
and too valuable to lose due to a late loading of certain resources.

A convenient cudgel with which to beat developers into doing the right thing is
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
to make the resource object "immutable" by not providing an update function. This
makes it impossible to late load resources and helps to avoid this mistake when
installing OpenTelemetry in an application.

However, OpenTelemetry has since developed a resource detector pattern that gives
developers the tools they need to cleanly resolve all initial resources before
application start. This is a sufficient solution for the bootstrapping problem;
at this point in OpenTelemetry's development adding an update function to resources
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
would not cause issues in this regard.

## Example Usage

DRAFT

## Example Implementation

Pseudocode examples for a possible Validator and ResourceProvider implementation.
Attention is placed on making the ResourceProvider thread safe, without introducing
any locking or synchronization overhead to `GetResource`, which is the only
ResourceProvider method on the hot path for OpenTelemetry instrumentation.

```php
// Example of a thread-safe ResourceProvider
class ResourceProvider{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really understand the pseudo code syntax. It seems to be Go with classes and some strange extra syntax that I do not get. Also, explicit this is really uncommon and only used in languages that require it due to bad language design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this comment again - The syntax/language for pseudo-code isn't important here.

Are you able to understand what the goal of the interface is from the description and the example? If so, let's evaluate that, not choice of pseudo-code syntax.

If you have specific things you don't understand, list them so they can be addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add that regardless of the pseudo-code someone chooses to use, we request that the examples to be 100% explicit and not assume that the reader knows any implicit details about a particular programming language.


*Resource resource
Lock lock
OrderedMap[string:Entity] entities // an ordered map of Entities that uses entity IDs as keys
Array[EntityListener] listeners

GetResource() Resource {
return this.resource;
}

OnChange(EntityListener listener) {
this.lock.Acquire();

listeners.Append(listener);

this.lock.Release();
}

AddEntity(string ID, string name, Map[Attribute] attributes){
this.lock.Acquire();

// Acquire the correct entity based on ID
var entity = NewEntity(ID, name, attributes);

// Append the entity to the end of the OrderedMap and set the key to the ID
this.entities.Append(ID, entity);

// create a new resource
var mergedResource = NewResource(this.entities);

// safely change the resource reference without blocking
AtomicSwap(this.resource, mergedResource);

// create an EntityState event from the entity
var entityState = entity.EntityState();

// calling listeners inside of the lock ensures that the listeners do not fire
// out of order or get called simultaneously by multiple threads, but would
// also allow a poorly implemented listener to block the ResourceProvider.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, plus the statement above is not really correct since the listener could just enqueue a task in a thread pool. I'm not sure why this part should be specified here. It isn't even required in languages/runtimes that do not have threads or commonly don't use them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes when it comes to language-specific examples it's best to look at prototypes, not pseudo-code. The pseudo-code is helpful to show that there is at least one way to implement the proposed changed in a multi-threaded language. It shouldn't be considered the only way or even the best way, but just something to allow us to discuss potential issues with the design that implementations may need to think about.

for (listener in this.listeners) {
listener.OnEntityState(entityState, mergedResource);
}

this.lock.Release();
}

UpdateEntity(string ID, Map[Attribute] attributes){
this.lock.Acquire();

// Acquire the correct entity based on ID
var entity = this.entities[ID]

// If there is no entity, log the error and return. This follows the pattern
// of not returning errors in the OpenTelemetry API.
if(!entity) {
LogError(EntityNotFound);
this.lock.Release();
return;
}

// Replace the attributes on the entity.
entity.attributes = attributes;

// create a new resource
var mergedResource = NewResource(this.entities);

// safely change the resource reference without blocking
AtomicSwap(this.resource, mergedResource);

// create an EntityState event from the entity
var entityState = entity.EntityState();

// calling listeners inside of the lock ensures that the listeners do not fire
// out of order or get called simultaneously by multiple threads, but would
// also allow a poorly implemented listener to block the ResourceProvider.
for (listener in this.listeners) {
listener.OnEntityState(entityState, mergedResource);
}

this.lock.Release();
}

DeleteEntity(string ID, Map[Attribute] attributes){
this.lock.Acquire();

// Acquire the correct entity based on ID
var entity = this.entities[ID]

// If there is no entity, log the error and return. This follows the pattern
// of not returning errors in the OpenTelemetry API.
if (!entity) {
LogError(EntityNotFound);
this.lock.Release();
return;
}

// remove the entity from the map
this.entities.Delete(ID);

// create a new resource
var mergedResource = NewResource(this.entities);

// safely change the resource reference without blocking
AtomicSwap(this.resource, mergedResource);

// create an EntityDelete event from the entity
var entityDelete = entity.EntityDelete();

// calling listeners inside of the lock ensures that the listeners do not fire
// out of order or get called simultaneously by multiple threads, but would
// also allow a poorly implemented listener to block the ResourceProvider.
for (listener in this.listeners) {
listener.OnEntityDelete(entityDelete, mergedResource);
}

this.lock.Release();
}
}
```
Loading