Add failure modes telemetry #646

pree-dew · 2025-10-09T19:48:59Z

Telemetry to cover failure modes which are not covered by container logs and metrics for finding resource constraints.

Motivation and Context

When there is any issue with registry container we should be notified.

How Has This Been Tested?

Local seup

Breaking Changes

No

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

No additional exporter is used, taken advantage of opentelemetry collector
It covers metrics related to resource constraints, currently only limited to default namespace.
Takes cares of kubernetes events as logs which are the source of figuring out any problem with service, covers all such scenarios where pod is not able to start yet and get missed because there are no container logs for such cases. Limited to default namespace.
Taken care of daemonset deployment i.e. deploying otel collector as agent by using correct filtering.
Cardinality contributing factors are only pod ids (but have to observe more), node ids will not increase cardinality as scale up will lead to limited nodes.
Shipping of metrics for resources happens every 60s and list of metrics that will be emitted https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/metadata.yaml
Container errors

Resource metrics

pree-dew · 2025-10-09T19:50:25Z

Issue #509

pree-dew · 2025-10-09T19:59:22Z

@rdimitrov @domdomegg @tadasant Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option.

rdimitrov · 2025-10-12T20:03:38Z

Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option.

I believe what we have now is what you describe - all commits to main get deployed to staging and only the latest release gets deployed to prod. Would that work for you or perhaps you'll need something else?

pree-dew · 2025-10-13T10:03:18Z

@rdimitrov That works assuming once we verify the above details then only release happens for production env, basically we might have to wait for other release to be deployed meanwhile.

Just being cautious here, looking at workflow files, looks like everything is getting deployed on production, anywhere that I can check the latest release part? just to make sure if this approach will work.

rdimitrov · 2025-10-13T11:26:35Z

You can check the workflows for how the container images are published but overall we use the main floating tag for staging and the latest floating tag for prod of which:

The main tag follows all commits to main

The latest tag follows all released versions, i.e. v1.x.y, etc.

pree-dew · 2025-10-14T06:49:01Z

@rdimitrov Thank you for clearing this out 🙂

I should have clarified this earlier, my bad. I am not looking at release process for specific tag, I wanted to get the stats for this branch before this gets deployed to production. In deploy.yml every commit to main is deploying to staging and then to production. When this branch will be merged it will get deployed to both environment. But to understand the cardinality as per above screenshot, I have to collect stats from staging env first, verify it and if cardinality is under control the deploy it to production.

domdomegg · 2025-10-16T11:42:39Z

In deploy.yml every commit to main is deploying to staging and then to production. When this branch will be merged it will get deployed to both environment

Ah yeah, this is how it used to work. But we changed it a couple weeks ago so that deploy only deploys main to staging by default. And to get to production you need to create a GitHub release.

domdomegg

Happy to try this in staging!

pree-dew · 2025-10-16T14:55:22Z

Super @domdomegg I will do a final testing on my end once again and we can then take it to staging.

pree-dew · 2025-10-19T09:32:43Z

@domdomegg @rdimitrov

Did 2-3 deployments, looks under control, out of bound labels are only pod ids.
Simulated Image pull back error also, it means events logs are also coming

============================================================

We can try this on staging

domdomegg · 2025-10-20T18:58:01Z

Should get deployed to staging automatically :)

pree-dew · 2025-10-20T21:08:57Z

Thank you @domdomegg

pree-dew added 2 commits October 9, 2025 17:27

add receiver, processor and exporter for kubelet and k8s_events

dc5ca43

add k8sattribute from pod association for log pipeline

b326abb

domdomegg approved these changes Oct 16, 2025

View reviewed changes

domdomegg approved these changes Oct 20, 2025

View reviewed changes

domdomegg merged commit 9164415 into modelcontextprotocol:main Oct 20, 2025
4 checks passed

pree-dew mentioned this pull request Oct 20, 2025

Enable exporters for resource observability, add kubelet logs in filelog receiver #509

Open

Add failure modes telemetry #646

Add failure modes telemetry #646

Uh oh!

Conversation

pree-dew commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

pree-dew commented Oct 9, 2025

Uh oh!

pree-dew commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdimitrov commented Oct 12, 2025

Uh oh!

pree-dew commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdimitrov commented Oct 13, 2025

Uh oh!

pree-dew commented Oct 14, 2025

Uh oh!

domdomegg commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domdomegg left a comment

Choose a reason for hiding this comment

Uh oh!

pree-dew commented Oct 16, 2025

Uh oh!

pree-dew commented Oct 19, 2025

Uh oh!

Uh oh!

domdomegg commented Oct 20, 2025

Uh oh!

pree-dew commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pree-dew commented Oct 9, 2025 •

edited

Loading

pree-dew commented Oct 9, 2025 •

edited

Loading

pree-dew commented Oct 13, 2025 •

edited

Loading

domdomegg commented Oct 16, 2025 •

edited

Loading