RFC: Evolve host.id into a default, always present attribute (rethinking machine-id) #581

christos68k · 2023-12-04T23:27:07Z

Similarly to #311, I'd like to first describe some issues with the existing host.id semantics around machine-id usage and propose some alternatives.

According to machine-id(5):

It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network. If a stable unique identifier that is tied to the machine is needed for some application, the machine ID or any part of it must not be used directly. Instead the machine ID should be hashed with a cryptographic, keyed hash function, using a fixed, application-specific key.

This goes against OpenTelemetry guidelines which dictate using the value of /etc/machine-id (or /var/lib/dbus/machine-id) verbatim. Besides following the recommendation in the manpage (keyed hash), another alternative is to use UUIDv5 (SHA1), similarly to #312.

A secondary issue is that OpenTelemetry guidelines do not specify a fallback for host.id in cases where the machine-id is missing. For example, this is quite common in containerized environments if the Docker volume mount does not exist. In such cases, UUIDv4 can be used to generate a value that may be cached by the client application and reused (for as long as it makes sense given a context that's specific to the application).

The text was updated successfully, but these errors were encountered:

mx-psi · 2023-12-07T16:38:20Z

cc @svrnm @sumo-drosiek @mwear (since you have worked on either spec or implementation of host.id on non-containerized systems)

svrnm · 2023-12-07T16:51:24Z

I remember reading the machine-id man file, but missed that section about confidentiality. I don't see a reason given for that?

If I understand the UUIDv5 definition it is equivalent to a "keyed hash", so it appears to be a logical solution to do something similar to #312. The question is what the "namespace" is (assuming the name is the machine-id), I guess this is best a fixed value, since the goal here is to have a "unique identifier that is tied to the machine".

A secondary issue is that OpenTelemetry guidelines do not specify a fallback for host.id in cases where the machine-id is missing. For example, this is quite common in containerized environments if the Docker volume mount does not exist. In such cases, UUIDv4 can be used to generate a value that may be cached by the client application and reused (for as long as it makes sense given a context that's specific to the application).

This is a topic that drove me crazy. So, to get started: My perspective is that host.id is simply not defined within a container that does not provide it's own machine-id, probably it's just best to not have it at all: that's also why there is " For non-containerized systems, this should be the machine-id".

In the case of a container, the container.id is what you want to set. But there is no reliable way to obtain it from within the container, see containerd/containerd#8185

mx-psi · 2023-12-07T17:17:47Z

We discussed this on the System Semantic Conventions WG, let me try to summarize what I said (plus some last-minute thoughts).

My main points were:

Having a unique identifier of a host (let's assume a well-defined concept of a host, even if that's its own can of worms :) ) is a very useful thing to have for correlation and infrastructure monitoring generally. I have worked on this extensively at Datadog, where we (as it roughly happens in OTel) we rely on multiple sources and have multiple implementations. At Datadog, we typically do not use machine-id but rather use other sources that can be more meaningful to end-users or can be more easily retrieved (e.g. operating system hostname, EC2 instance id, Azure VM id, Kubernetes node and cluster name...).
Despite their usefulness, there are some times in which there is no meaningful unique identifier. This is the case for containerized systems or certain virtualized environments: container IDs are hard, and it could be that /etc/machine-id is empty as it happens on many container base images, or you may not have access to the hostname or it be a random one (also the case on containers many times!). This has been an issue for the OTel project in the past (see [processor/resourcedetection] system detector sets host.id to an empty value on containerized setups opentelemetry-collector-contrib#24230).
Generating a random identifier when you cannot get one is fragile in that restarts of the monitoring application (e.g. the Collector) or of the container churns new IDs. This can be a huge issue because of cardinality explosion and is also a problem we have had on OTel (see [resourcedetectionprocessor]: add host.id to system detector opentelemetry-collector-contrib#18618 (comment) for a user report, see "Additional context" in previously mentioned issue for even more details on how that happened). It's also hard (impossible?) to reliably tell that you are in a container, so you cannot (to my knowledge!) just generate it and store to a file just when you know it's going to be robust across restarts.
If this is really a security concern, we shouldn't be adding it on the OTel SDKs or the OTel Collector. We should instead hash it or some other issue. How to do this is somewhat arbitrary, so we should look at what other monitoring solutions do here and hash.

svrnm · 2023-12-08T11:14:47Z

Based on your summary, the subject of this issue (Evolve host.id into a default, always present attribute) seems not to be something that can be accomplished (at least from within the OpenTelemetry community alone)?

A few extra comments:

Having a unique identifier of a host [...] is a very useful thing to have for correlation and infrastructure monitoring generally.

💯 -- this is really a BIG concern, coming up again and again. In a large environment when something goes wrong but only one (or a subset) of instances within a service are affected, it's crucial to know which one(s). As you outlined above, this is already not trivial for a lasting instance (bare metal, VMs, ...) and get's even more complicated with ephemeral instances (containers, ...). That's also why I raised containerd/containerd#8185, which has a history of similar issues attached to it (see opencontainers/runtime-spec#1105). @mx-psi if there is any value to it I can give a rundown of this to the System Semantic Conventions WG eventually.

It's also hard (impossible?) to reliably tell that you are in a container [...]

To call this out, even if the issue above with containers gets solved somehow, it will very likely stay optional, as it's in the nature of a being in a container to not know about being in a container (by default).

christos68k · 2023-12-08T22:30:12Z

Given that we have no guidelines for containerized environments, does it make sense to add a containerized section (or generify the non-containerized one) and specify that the machine-id should be used if it is mapped inside the container? This is quite common through the volume mount and even works on Docker/macOS even though /etc/machine-id doesn't exist on the host:

macos$ cat /etc/machine-id
cat: /etc/machine-id: No such file or directory

macos$ docker run --rm -ti alpine /bin/sh
/ # cat /etc/machine-id
cat: can't open '/etc/machine-id': No such file or directory

macos$ docker run -v /etc/machine-id:/etc/machine-id --rm -ti alpine /bin/sh
/ # cat /etc/machine-id
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Does it make sense to further specify that in cases where machine-id is missing or empty, host.id may be populated with a 'stable' low-cardinality value and leave the implementation up to the SDK/user? An example of such a value could be a MAC address, possibly hashed with another id such as the hostname.

svrnm · 2023-12-11T09:46:38Z

Given that we have no guidelines for containerized environments, does it make sense to add a containerized section (or generify the non-containerized one)

Under the assumption that a container is identified by container.id what is the value of having host.id as well? I am also wondering if host.id in that case is even well defined: a container is running within a container engine, that is running on a "host system", that may have a host.id itself. If I know want to say "container with ID x is running on host with ID y" what is the right host.id?

I think the implicit definition (so far), was that host.id is not set within a container.

christos68k · 2024-04-03T18:08:40Z

Resurrecting this thread:

Under the assumption that a container is identified by container.id what is the value of having host.id as well?

These are different attributes with different lifetimes and different semantics. The value of host.id stems from it exhibiting less temporal changes than container.id, which allows for stable and meaningful correlation at the host level.

I am also wondering if host.id in that case is even well defined: a container is running within a container engine, that is running on a "host system", that may have a host.id itself. If I know want to say "container with ID x is running on host with ID y" what is the right host.id?

Going by the proposed updates in #576, in priority order:

instance_id assigned by cloud provider (if cloud)
machine-id (if mapped inside the container)
Something else that we might or might not want to make explicit (e.g. low-cardinality computed value based on host attributes such as the MAC address)

I think the implicit definition (so far), was that host.id is not set within a container.

To give you an example, our universal profiling product comes with deployment instructions that map machine-id inside the container, and host.id is populated from it. This enables stable correlation across thousands of deployed agents that would otherwise not be possible. Enabling this volume mount is not uncommon and even Docker on macOS supports it.

To recap:

I think we are all in agreement regarding hashing the machine-id value and not using it verbatim (e.g. see Define a common algorithm for service.instance.id #312)
Can we agree that there is value in populating host.id inside containers, if a low-cardinality and stable value (e.g. machine-id) is available? That is to both encourage clients that can (or already) do this, but also to ensure that this behavior is not breaking the spec.

svrnm · 2024-04-08T10:16:34Z

Can we agree that there is value in populating host.id inside containers

Yes, it is valuable, if and only if host.id is the id of the container host.

(e.g. machine-id) is available

Very often machine-id is not available from within the container and mounting it is also not possible/allowed (e.g. in managed environments), and for the same reasons it may not be desirable to expose it verbatim into the container. Additionally the container may create a machine-id on initialization itself and you need to distinguish that (maybe you can check if that file is mounted or not from within the container?)

mx-psi · 2024-04-08T16:01:53Z

Can we agree that there is value in populating host.id inside containers, if a low-cardinality and stable value (e.g. machine-id) is available? That is to both encourage clients that can (or already) do this, but also to ensure that this behavior is not breaking the spec.

I think there is value, but the discussion feels a bit theoretical to me. Other than the user passing it explicitly via an environment variable of some other sort of convention, I can't think of a way to reliably retrieve this value on a container and not fall prey to issues like open-telemetry/opentelemetry-collector-contrib#18618 (comment)

svrnm · 2024-04-09T08:19:50Z

I can't think of a way to reliably retrieve this value on a container

I unfortunately hadn't have the time to follow up on this, but I still think that our best option is working with container projects to get an at least optional but standardized way of making a container + container host identifiable from within the container, see containerd/containerd#8185

github-actions bot assigned AlexanderWert Dec 4, 2023

svrnm mentioned this issue Dec 8, 2023

Define a common algorithm for service.instance.id #312

Merged

3 tasks

github-actions bot added the Stale label Feb 8, 2024

joaopgrassi removed the Stale label Feb 14, 2024

smith mentioned this issue Apr 2, 2024

For UI mapping of data, use pod name where container id is not present or has no information elastic/kibana#178209

Open

christos68k mentioned this issue Apr 24, 2024

REQUEST: New membership for christos68k open-telemetry/community#2068

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Evolve host.id into a default, always present attribute (rethinking machine-id) #581

RFC: Evolve host.id into a default, always present attribute (rethinking machine-id) #581

christos68k commented Dec 4, 2023 •

edited

Loading

mx-psi commented Dec 7, 2023 •

edited

Loading

svrnm commented Dec 7, 2023 •

edited

Loading

mx-psi commented Dec 7, 2023

svrnm commented Dec 8, 2023

christos68k commented Dec 8, 2023 •

edited

Loading

svrnm commented Dec 11, 2023

christos68k commented Apr 3, 2024 •

edited

Loading

svrnm commented Apr 8, 2024

mx-psi commented Apr 8, 2024

svrnm commented Apr 9, 2024

RFC: Evolve host.id into a default, always present attribute (rethinking machine-id) #581

RFC: Evolve host.id into a default, always present attribute (rethinking machine-id) #581

Comments

christos68k commented Dec 4, 2023 • edited Loading

mx-psi commented Dec 7, 2023 • edited Loading

svrnm commented Dec 7, 2023 • edited Loading

mx-psi commented Dec 7, 2023

svrnm commented Dec 8, 2023

christos68k commented Dec 8, 2023 • edited Loading

svrnm commented Dec 11, 2023

christos68k commented Apr 3, 2024 • edited Loading

svrnm commented Apr 8, 2024

mx-psi commented Apr 8, 2024

svrnm commented Apr 9, 2024

christos68k commented Dec 4, 2023 •

edited

Loading

mx-psi commented Dec 7, 2023 •

edited

Loading

svrnm commented Dec 7, 2023 •

edited

Loading

christos68k commented Dec 8, 2023 •

edited

Loading

christos68k commented Apr 3, 2024 •

edited

Loading