Skip to content

Conversation

pree-dew
Copy link
Contributor

@pree-dew pree-dew commented Oct 9, 2025

Telemetry to cover failure modes which are not covered by container logs and metrics for finding resource constraints.

Motivation and Context

When there is any issue with registry container we should be notified.

How Has This Been Tested?

  • Local seup

Breaking Changes

  • No

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

  • No additional exporter is used, taken advantage of opentelemetry collector

  • It covers metrics related to resource constraints, currently only limited to default namespace.

  • Takes cares of kubernetes events as logs which are the source of figuring out any problem with service, covers all such scenarios where pod is not able to start yet and get missed because there are no container logs for such cases. Limited to default namespace.

  • Taken care of daemonset deployment i.e. deploying otel collector as agent by using correct filtering.

  • Cardinality contributing factors are only pod ids (but have to observe more), node ids will not increase cardinality as scale up will lead to limited nodes.

  • Shipping of metrics for resources happens every 60s and list of metrics that will be emitted https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/metadata.yaml

  • Container errors

Screenshot 2025-10-10 at 1 21 14 AM
  • Resource metrics
Screenshot 2025-10-10 at 1 23 51 AM

@pree-dew
Copy link
Contributor Author

pree-dew commented Oct 9, 2025

Issue #509

@pree-dew
Copy link
Contributor Author

pree-dew commented Oct 9, 2025

@rdimitrov @domdomegg @tadasant Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option.

Screenshot 2025-10-10 at 1 27 54 AM

@rdimitrov
Copy link
Member

Is there a possibility where we can run a deployment on staging for some time then push to production? Wanted to check the cardinality number for this release before it goes to production, something like this(screenshot), here it is high because of so many deployments I have done that won't be case for production. I have done very good testing around this but still wanted to see if this is an option.

I believe what we have now is what you describe - all commits to main get deployed to staging and only the latest release gets deployed to prod. Would that work for you or perhaps you'll need something else?

@pree-dew
Copy link
Contributor Author

pree-dew commented Oct 13, 2025

@rdimitrov That works assuming once we verify the above details then only release happens for production env, basically we might have to wait for other release to be deployed meanwhile.

Just being cautious here, looking at workflow files, looks like everything is getting deployed on production, anywhere that I can check the latest release part? just to make sure if this approach will work.

@rdimitrov
Copy link
Member

You can check the workflows for how the container images are published but overall we use the main floating tag for staging and the latest floating tag for prod of which:

  • The main tag follows all commits to main
image
  • The latest tag follows all released versions, i.e. v1.x.y, etc.
image

@pree-dew
Copy link
Contributor Author

@rdimitrov Thank you for clearing this out 🙂

I should have clarified this earlier, my bad. I am not looking at release process for specific tag, I wanted to get the stats for this branch before this gets deployed to production. In deploy.yml every commit to main is deploying to staging and then to production. When this branch will be merged it will get deployed to both environment. But to understand the cardinality as per above screenshot, I have to collect stats from staging env first, verify it and if cardinality is under control the deploy it to production.

@domdomegg
Copy link
Member

domdomegg commented Oct 16, 2025

In deploy.yml every commit to main is deploying to staging and then to production. When this branch will be merged it will get deployed to both environment

Ah yeah, this is how it used to work. But we changed it a couple weeks ago so that deploy only deploys main to staging by default. And to get to production you need to create a GitHub release.

Copy link
Member

@domdomegg domdomegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to try this in staging!

@pree-dew
Copy link
Contributor Author

Super @domdomegg I will do a final testing on my end once again and we can then take it to staging.

@pree-dew
Copy link
Contributor Author

@domdomegg @rdimitrov

  • Did 2-3 deployments, looks under control, out of bound labels are only pod ids.
  • Simulated Image pull back error also, it means events logs are also coming
Screenshot 2025-10-19 at 2 58 53 PM

============================================================

Screenshot 2025-10-19 at 3 00 33 PM

We can try this on staging

@domdomegg domdomegg merged commit 9164415 into modelcontextprotocol:main Oct 20, 2025
4 checks passed
@domdomegg
Copy link
Member

Should get deployed to staging automatically :)

@pree-dew
Copy link
Contributor Author

Thank you @domdomegg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants