-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[USM] FIx Istio issue #43651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
amitslavin
wants to merge
79
commits into
main
Choose a base branch
from
SUSM-155-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[USM] FIx Istio issue #43651
+37,344
−33,293
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Backport 6f75013 from #40576. ___ ### What does this PR do? Adds a `conf.yaml.example` for the Versa integration ### Motivation Prepare Versa integration for public use ### Describe how you validated your changes Ran through the versa.go to ensure that the expected configuration options are exposed. ### Additional Notes Co-authored-by: Ken Schneider <[email protected]> Co-authored-by: Dustin Long <[email protected]>
…40658) This is a manual backport of #40639 to 7.71.x due to merge conflicts. ### What does this PR do? This PR fixes an issue where datadog-traceroute was potentially incorrectly handling file descriptors on linux. More details on the datadog-traceroute PR: [\[linux\] Fix linuxSink file descriptor handling](DataDog/datadog-traceroute#24) ### Motivation Errors from the current 7.71 RC ### Describe how you validated your changes This was triggered by the finalizer of `*os.File` -- the datadog traceroute library now has zero manual calls to unix.Close() on linux which was the source of this issue. It purely uses os.File's handling to close files
### Motivation This fixes issues with S3 cache by fetching the target branch when doing `git merge-base` operations. ### Describe how you validated your changes ### Additional Notes #incident-42773
Backport ff4c8bb from #40623. ___ This PR fixes a bug found during QA where leftover config/package experiments could be picked up during unrelated package/config experiments. This should ensure a config experiment always uses the stable package and a package experiment always uses the stable config. Co-authored-by: Arthur Bellal <[email protected]>
Backport 6936e9f from #40607. ___ These tests were poorly conceived. I wanted to make sure that the tests still ran with a low depth limit, but I wasn't validating much. The way to get the test to avoid having expectations was to tell it it was in rewrite mode. The downside of being in rewrite mode is we do not know how many events to expect, so we just wait a while. This turns out to be problematic and can cause flakes in production. We do exercise a bunch of depth limits explicitly. For now, just remove this bad subtest. Co-authored-by: ajwerner <[email protected]> Co-authored-by: piob-io <[email protected]>
Backport a8014fd from #40680. ___ ### What does this PR do? Following #40345, the scrubber was scrubbing debugging information, this PR aims to solve this issue ### Motivation Avoid decreasing support efficiency by not be able to read configuration for secret feature: ``` root@datadog-qn879:/tmp# ./agent config | grep secret_ secret_name: "********" secret_audit_file_max_size: "********" secret_backend_arguments: "********" secret_backend_command: "********" secret_backend_command_allow_group_exec_perm: "********" secret_backend_config: "********" secret_backend_output_max_size: "********" secret_backend_remove_trailing_line_break: "********" secret_backend_skip_checks: "********" secret_backend_timeout: "********" secret_backend_type: "********" secret_image_to_secret: "********" secret_kubernetes: "********" secret_refresh_interval: "********" secret_refresh_scatter: "********" ``` ### Describe how you validated your changes CI ### Additional Notes Co-authored-by: louis-cqrl <[email protected]>
…ronously (#40687) ### What does this PR do? This fixes a missing nil check by pulling in the datadog-traceroute PR [\[filters\] Check for nil in SetBPFAndDrain](DataDog/datadog-traceroute#28). Normally this almost always has an error, but occasionally `MSG_DONTWAIT` can finish synchronously. ### Motivation Rarely we get this error running netpath at scale in staging: ``` SYS-PROBE | ERROR | (cmd/system-probe/modules/traceroute.go:72 in func1) | unable to run traceroute for host: 10.128.11.211: UDP traceroute failed to set packet filter: SetPacketFilter failed to apply BPF filter: SetBPFAndDrain failed to drain: %!w(<nil>) ``` Apparently this syscall can occasionally finish synchronously which results in a nil error. I think it happens on hosts with barely any network traffic. ### Describe how you validated your changes In datadog-traceroute, I ran the packet filtering suite 100 times: ``` sudo env PATH="$PATH" go test -v -tags linux,test,root github.com/DataDog/datadog-traceroute/packets -count 100 ``` (I had to make a few extra changes to get the root privileged test suite compiling: [\[packets\] Use os.File to close all fds](DataDog/datadog-traceroute#25)) --------- Co-authored-by: sabrina lu <[email protected]>
Backport c86042b from #40650. ___ ### What does this PR do? Adds the Versa integration to the build tasks as a core check so the `conf.yaml.example` is included in the build. ### Motivation ### Describe how you validated your changes ### Additional Notes Co-authored-by: Ken Schneider <[email protected]>
…gths (#40775) Backport 4b43b06 from #40765. ___ This fixes a bug that occurs when compressed length is longer than compressed one. Add a cli for running irgen. Co-authored-by: Piotr Bejda <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
…es (#40788) Backport d4093ba from #40779. ___ #### dyninst/rcscrape: mark if a data item had a failed read We've seen data items missing related to a runtime ID. Perhaps we failed to read those memory addresses. We'd like to know about that. Note that this same treatment should get applied to regular decoding (perhaps in a more principled way). That's left for a different change that won't be backported. #### dyninst/rcscrape: gracefully handle decoding failures If we can't decode a message, we don't want to shut down the entire dyninst subsystem. The fact that that's what happens when decoding fails is not great, but it's not for this change we intend to backport. There's an upcoming refactor to address that. Fixes [DEBUG-4455](https://datadoghq.atlassian.net/browse/DEBUG-4455). [DEBUG-4455]: https://datadoghq.atlassian.net/browse/DEBUG-4455?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: ajwerner <[email protected]>
…tion/successful_async_initialization` (#40799) Backport e0d98b0 from #40642. ___ ### What does this PR do? #### The Test Bug `TestAsyncInitialization/successful_async_initialization` was flaking due to a race condition deadlock. This PR fixes the test to un-flake the test. This is not a bug in the ImageResolver feature, but a bug specifically in the way the test is configured. Prior to this change, the following race condition would cause a deadlock: 1. `mockClient := newMockRCClient(...)` -> `blockGetConfigs`: `false` and `configsReady`: `make(chan struct{})` 2. `mockClient.setBlocking(true)` -> mutex lock **HOLD**, `blockGetConfigs`: `true`, mutex lock **RELEASE** 3. `resolver := newRemoteConfigImageResolverWithRetryConfig(...)` - `go func() waitForInitialConfig()` -> `rcClient.GetConfigs(...)` -> mutex lock **HOLD** until `configsReady` closed 4. `mockClient.setBlocking(false)` -> mutex lock **HOLD** ... ❌ cannot close `configsReady` because mutex lock by goroutine. If step 3 (go routine) would get the mutex lock first, this would create a deadlock situation. If step 4 (test) gets the mutex lock first, then it would successfully close the `configsReady` channel, release the mutex lock, and then the goroutine would hold the mutex lock, do its work, then release the mutex. #### The Fix The fix for the above is to explicitly release the mutex hold in `GetConfig(...)` after we store the state of `m.blockGetConfigs` and `m.configsReady` in local variables first. Then we block `GetConfig(...)` on those local vars. This way, regardless of which step obtains the mutex first, it will be released properly. ### Motivation Un-flake the `TestAsyncInitialization/successful_async_initialization` test for the ImageResolver to avoid blocking CI. See example of failure from this [here](https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/1113117038). ### Describe how you validated your changes Reproduced CI issue locally with artificial delays. The above fix prevents this race condition consistently. ### Additional Notes Co-authored-by: erikayasuda <[email protected]>
…eResolver.Resolve()` (#40785) Backport 730bb5d from #40697. ___ ### What does this PR do? Adds `image_resolution_attempts` telemetry to the newly added `ImageResolver.Resolve()` function to track image resolution attempts, and whether they resulted in a successful image digest resolution, or if they defaulted to using the mutable tag. ### Motivation This telemetry is necessary for the new K8s SSI gradual rollout feature in order to determine whether a rollout is working successfully or not. The goal is for this to be easier to display on a dashboard: | `registry` | `repository` | `digest_resolution` | `outcome` | Notes | |----------|----------|----------|----------|----------| | `gcr.io/datadoghq` | `dd-lib-python-init` | `enabled` | `sha256:abc123` | ✅ Was supposed to resolve, did resolve | | `hub.docker.com/r/datadog` | `dd-lib-java-init` | `enabled` | `v2` | ❌ Was supposed to resolve, but it did NOT resolve | | `mycustomregistry.org` | `dd-lib-php-init` | `enabled` | `v2` | ✅ Cannot resolve for custom registry, did NOT resolve | | `gallery.ecr.aws/datadog` | `dd-lib-rb-init` | `disabled` | `v1` | ✅ Was NOT supposed to resolve, did NOT resolve | | `gcr.io/datadoghq` | `dd-lib-dotnet-init` | `disabled` | `sha256:abc123` | ❌ Was NOT supposed to resolve, did resolve *this shouldn't be possible | ### Describe how you validated your changes Ran a local app via `injector-dev` to verify that the telemetry counts for `apm-inject` and `dd-lib-python-init` were accurate. ### Additional Notes Co-authored-by: erikayasuda <[email protected]> Co-authored-by: sabrina lu <[email protected]>
Backport d3ce06c from #40789. ___ ### What does this PR do? This commit fixes an issue where we only resolve image tags during startup and not on every pod mutation by ensuring the resolution happens just before use. ### Motivation We've added gradual rollout support for Single Step Instrumentation so that language libraries are released in a gradual fashion in #39915. This was missed during code review and caught during testing. ### Describe how you validated your changes I tested this using [injector-dev](https://github.com/DataDog/injector-dev): ``` injector-dev apply -f dev.yaml --profile staging --build ``` <details> <summary>dev.yaml</summary> ```yaml helm: apps: - name: gradual-rollout-test namespace: application values: env: - name: DD_TRACE_DEBUG value: "true" - name: DD_APM_INSTRUMENTATION_DEBUG value: "true" image: repository: registry.ddbuild.io/ci/injector-dev/python tag: 2cd78ded podLabels: language: python tags.datadoghq.com/env: local service: port: "8080" versions: agent: 7.69.1 cluster_agent: version: 7.69.1 build: {} injector: version: 0.44.0 config: clusterAgent: env: - name: DD_REMOTE_CONFIGURATION_ENABLED value: "true" datadog: site: "datad0g.com" apm: instrumentation: enabled: true targets: - name: python podSelector: matchLabels: language: python ddTraceVersions: python: default ``` </details> ### Additional Notes We will need this backported to `7.71.x` Co-authored-by: Mark Spicer <[email protected]> Co-authored-by: adel121 <[email protected]>
…history from status and allow directional-only fallback (#40823) Backport f65e87a from #40542 ___ ### What does this PR do? * Load recommendation history from new `LastRecommendations` field in CR * Allow local fallback to only be applied if it is for upscale / downscale / for both ### Motivation Loading recommendation history allows us to more accurately determine the stabilized recommendation Changes to fallback allow us to shorten the time to enable fallback by providing users the flexibility to set how they'd like fallback to be activated ### Describe how you validated your changes 1. Deploy these changes; set up DPA CR to set fallback direction 2. Verify that fallback is only enabled when the scaling direction matches the enabled direction 3. Restart the cluster agent - verify that DPAs in store is populated with data from the CR for recommendation history ### Additional Notes
… checks (#40767) Backport f10a8b6 from #40532. ___ ### What does this PR do? Stop using GET /probe endpoints to perform connectivity checks. Replaced by POST with empty payloads or removed. ### Motivation Support question from a concerned customer In the connectivity check, we use GET /probe endpoints. These endpoints are exposed by synthetics which is available depending on customer orgs setup and causes 403 when the endpoint is not available. ### Describe how you validated your changes Manual QA ### Additional Notes Co-authored-by: san-jos <[email protected]>
…40832) Backport 4dc700f from #40827. ___ Before this change, we wouldn't send fresh diagnostics for updated probes, making it seem like we haven't installed them. Fixes https://datadoghq.atlassian.net/browse/DEBUG-4467 Co-authored-by: ajwerner <[email protected]>
…tchers for api/app keys & common HTTP auth headers (#40837) Backport 2643133 from #40774. ___ ### What does this PR do? - **Normalizes YAML keys to lowercase before matching** so scrubbing is case-insensitive across config variants. - In `ScrubDataObj`, keys are lowercased for the `YAMLKeyRegex` check. - **Expands key match coverage**: - `api_key` → `api[-_]?key` - `ap(?:p|plication)_?key` → `ap(?:p|plication)[-_]?key` - Adds support for **HTTP header-style fields** via `matchYAMLKeyPrefixSuffix("x-","key|token|auth", …)` and explicit lists: - `x-api-key`, `x-rapidapi-key`, `x-functions-key`, `x-octopus-apikey`, `x-dreamfactory-api-key`, `x-lz-api-key`, `x-pm-partner-key`, `x-sungard-idp-api-key`, `x-vtex-api-appkey` - `x-auth-token`, `x-rundeck-auth-token` - `x-auth`, `x-stratum-auth` - Adds **exact key matches** for common auth fields: - `auth-tenantid`, `authority`, `cainzapp-api-key`, `cms-svc-api-key`, `lodauth`, `sec-websocket-key`, `statuskey` - **Regex robustness**: - Allow hyphens in YAML key detectors (`(\w|_|-)`) in `matchYAMLKeyPart` and `matchYAMLKeyEnding`. - New helper: `matchYAMLKeyPrefixSuffix`. - **Version metadata**: - Sets `LastUpdated` to `7.70.2` for updated replacers. - **Tests**: - `pkg/util/scrubber/default_test.go`: new `TestNewHTTPHeaderAndExactKeys`. - `pkg/util/scrubber/yaml_scrubber_test.go`: comprehensive table-driven cases for case/format variants. ### Motivation Configurations and headers vary widely in **case** (`APIKEY`, `Api-Key`) and **separators** (`_` vs `-`). Prior scrubbing missed many real-world keys, potentially leaking secrets in logs/support bundles. Lowercasing keys for matching + expanding patterns closes these gaps without over-scrubbing generic keys. ### Describe how you validated your changes - **Unit tests** (new & updated) cover: - Case variants: `APIKEY`, `Api_key`, `api-key`, `apikey` - HTTP headers with `x-` prefix and `key`/`token`/`auth` suffixes - Exact-match auth fields - Non-matching benign keys remain untouched Co-authored-by: louis-cqrl <[email protected]> Co-authored-by: sabrina lu <[email protected]>
Backport 909cc7d from #40565. ___ ### What does this PR do? Simply adds minified version of several javascript files and replaces originals in deliverables to optimize their size. Co-authored-by: Joseph Gette <[email protected]>
…name as key (#40871) Backport 191acf6 from #40762. ___ ### What does this PR do? Restructures the `ImageResolver` cache structure so that the keys are only the repository names (ex - `dd-lib-python-init`) and not the repository URL (ex - `gcr.io/dd-lib-python-init`). It also enables us to configure the default Datadog container registries via configuration. This is not public-facing, and will be used primarily for dogfooding the new feature on staging (given staging uses different container registries). ### Motivation The repository URL field in the `K8S_INJECTION_DD` remote config data was not designed to consumed by DCA, but primarily for use by our internal `equilibrium` tooling. This meant that by using the repository URL for the key, it would be limited to only allowing gradual rollout for customers using `grc.io/datadoghq` as their registry. See [here](https://docs.datadoghq.com/tracing/trace_collection/automatic_instrumentation/single-step-apm/kubernetes/?tab=agentv764recommended#change-the-default-image-registry) for other valid Datadog registries. ### Describe how you validated your changes - Updated existing tests - Added new unit tests - Local E2E testing with `injector-dev` ### Additional Notes Co-authored-by: erikayasuda <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
Backport 35654a3 from #40756. ___ Co-authored-by: Florent Clarret <[email protected]>
Backport 6b6ae07 from #40884. ___ Co-authored-by: Florent Clarret <[email protected]>
…imit (#40876) Backport e693ea6 from #40860. ___ ### What does this PR do? Before this change, we'd flush *after* adding a message that puts the batch over the limit. Now we'll flush the current buffer before exceeding the limit. Fixes https://datadoghq.atlassian.net/browse/DEBUG-4480 ### Motivation We've been seeing 413 errors in staging. ### Describe how you validated your changes Added testing. Co-authored-by: ajwerner <[email protected]>
…I are properly sorted (#40895) Backport 7be3f24 from #40889. ___ ### What does this PR do? Ensures that the devices returned by the PodResources API are properly sorted before being returned. In some cases we have seen k8s returning them in a different order than what they will seen as in the system. ### Motivation Ensure correct attribution of devices. https://datadoghq.atlassian.net/browse/EBPF-813 ### Describe how you validated your changes Added unit tests. ### Additional Notes Co-authored-by: Guillermo Julián <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
Backport 7bd16cd from #40924. ___ ### What does this PR do? Fixes #39647 by closing the root after using it. ### Motivation ### Describe how you validated your changes ### Additional Notes Co-authored-by: Paul Cacheux <[email protected]>
…by main consumer loop (#41226) Backport 74a7786 from #41193. ___ ### What does this PR do? Changes the processing of process exit events so that they happen in the main consumer loop. ### Motivation Avoid race conditions by ensuring process exits are handled in the same goroutine, no matter whether they come from process scans or from the process monitor. ### Describe how you validated your changes Added unit tests to ensure process exit is properly handled. ### Additional Notes Co-authored-by: Guillermo Julián <[email protected]>
The kubelet entity is currently only used to generate a tag on kubernetes check metrics, it has no tags of its own. (cherry picked from commit 34f0e21) ### What does this PR do? Fixes the error logs that occur when trying to build the kubelet's tagger entity ID ### Motivation Reduce error logs ### Describe how you validated your changes 1. Deploy RC 7 and run `agent check kubelet` to kickoff the workloadmeta kubelet collector ``` root@justin-lesko-minikube:/# agent check kubelet | grep "kubelet-id" 2025-09-23 19:05:49 UTC | CORE | ERROR | (comp/core/tagger/common/entity_id_builder.go:35 in BuildTaggerEntityID) | can't recognize entity "kubelet-id" with kind "kubelet"; trying kubelet-id://kubelet as tagger entity 2025-09-23 19:05:49 UTC | CORE | ERROR | (comp/core/tagger/collectors/workloadmeta_extract.go:161 in processEvents) | cannot handle event for entity "kubelet-id" with kind "kubelet" ``` 2. Deploy this branch and observe no more errors ``` root@justin-lesko-minikube:/# agent check kubelet | grep "kubelet-id" root@justin-lesko-minikube:/# ``` ### Additional Notes
…r older CRs (#41234) Backport 0b9ffe1 from #41212. ___ ### What does this PR do? Fixes a case where the CR does not contain any value preventing from applying any fallback. Fixes feature introduced in #40542 ### Motivation Bugfix ### Describe how you validated your changes Use a 7.71+ Agent version with an outdated CRs or CRD, local fallback should still happen instead of being refused due to scaling direction disabled. ### Additional Notes Cluster Agent impact only Co-authored-by: Vincent Boulineau <[email protected]>
…d for psycopg #41003 (#41238) ### What does this PR do? ### Motivation ### Describe how you validated your changes ### Additional Notes --------- Co-authored-by: Florent Clarret <[email protected]> Co-authored-by: sabrina-datadog <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
…ClosedProcesses flaky test (#41310) Backport ec77d23 from #41302. ___ ### What does this PR do? Fixes a low frequency flaky test. The way the test was written, the `procRoot` with the fake data was not being actually used to check for the closed processes (it was being passed to the context but not the consumer), and instead it was using the real `/proc` path. This made the test work most of the time, except when we had a real process with the PID of our fake process. ### Motivation Eliminate flaky tests. ### Describe how you validated your changes Fixed the unit test. ### Additional Notes Co-authored-by: Guillermo Julián <[email protected]>
…re release (#41314) ### What does this PR do? Backports #41196 to 7.71.x ### Motivation Making sure tests don't break on 7.71.x when a new install script gets released ### Describe how you validated your changes If E2E tests pass we're good! ### Additional Notes --------- Co-authored-by: Arthur Bellal <[email protected]>
Backport c46abad from #40181. ___ <!-- * Contributors are encouraged to read our [CONTRIBUTING](/CONTRIBUTING.md) documentation. * Both Contributor and Reviewer Checklists are available at https://datadoghq.dev/datadog-agent/guidelines/contributing/#pull-requests. * The pull request: * Should only fix one issue or add one feature at a time. * Must update the test suite for the relevant functionality. * Should pass all status checks before being reviewed or merged. * Commit titles should be prefixed with general area of pull request's change. * Please fill the below sections if possible with relevant information or links. --> ### What does this PR do? > [!NOTE] > Buildimages are also bumped in this PR. This migrates Gitlab PATs, now we use a custom API to generate short lived Gitlab PATs. ### Motivation ### Describe how you validated your changes <!-- Validate your changes before merge, ensuring that: * Your PR is tested by static / unit / integrations / e2e tests * Your PR description details which e2e tests cover your changes, if any * The PR description contains details of how you validated your changes. If you validated changes manually and not through automated tests, add context on why automated tests did not fit your changes validation. If you want additional validation by a second person, you can ask reviewers to do it. Describe how to set up an environment for manual tests in the PR description. Manual validation is expected to happen on every commit before merge. Any manual validation step should then map to an automated test. Manual validation should not substitute automation, minus exceptions not supported by test tooling yet. --> ### Possible Drawbacks / Trade-offs ### Additional Notes <!-- * Anything else we should know when reviewing? * Include benchmarking information here whenever possible. * Include info about alternatives that were considered and why the proposed version was chosen. --> Co-authored-by: Célian Raimbault <[email protected]>
### What does this PR do? fix release notes ### Motivation ### Describe how you validated your changes ### Additional Notes
…onfig stream snapshot creation (#41289) Backport e4fcdfc from #41279. ___ ### What does this PR do? Updates config stream to handle nested keys in the configuration that are type `map[interface{}]interface{}` ### Motivation Previously, we were only converting the top level to `map[string]interface` ### Describe how you validated your changes 1. Create custom image and deployed it on an experimental cluster (sasquatch) with ADP enabled. 2. Run `agent-data-plane config` and see the config. 3. Update a setting (ex: `agent config set dogstatd_stats true`) 4. . Run `agent-data-plane config` and see the updated config value. ### Additional Notes Co-authored-by: Raymond Zhao <[email protected]>
Backport a4e26e7 from #41166. ___ ### What does this PR do? don't pass context when launching detached process ### Motivation https://datadoghq.atlassian.net/browse/WINA-1666 https://datadoghq.atlassian.net/browse/WINA-1707 fix bug that can cause fleet upgrades to fail ### Describe how you validated your changes existing E2E tests fail due to this issue once every few days manual test: added sleep after `i.stop()` to give time for background terminate process to finish ### Additional Notes before returning from main, `hookCommand` calls defer `i.stop(err)` https://github.com/DataDog/datadog-agent/blob/4c3ba894409b1fde22ed134e0b7a2eb518b80297/pkg/fleet/installer/commands/hooks.go#L37-L38 `i.stop` calls `c.stopsighandler` https://github.com/DataDog/datadog-agent/blob/4c3ba894409b1fde22ed134e0b7a2eb518b80297/pkg/fleet/installer/commands/command.go#L73-L80 which is the context cancelfunc https://github.com/DataDog/datadog-agent/blob/4c3ba894409b1fde22ed134e0b7a2eb518b80297/pkg/fleet/installer/commands/command.go#L46-L55 Since this context was passed to `exec.CommandContext` when launching the detached `postStartExperimentBackground` subprocess, it sees `ctx.Done()` and sends a kill signal to the subprocess. Co-authored-by: Branden Clark <[email protected]>
…hes do not match (#41324) Backport 8fb9d9b from #41301. ___ ### What does this PR do? This PR fixes the behaviour of the Cluster Agent when the `.Spec` retrieved _after_ a DPA has been created/updated with values not originating from the Cluster Agent itself. The discrepancy can come from different sources: admission controller, CRD defaulting, etc. which are not necessarily expected but should not break the overall feature either. ### Motivation Fixing inactive remote Autoscalers when Remote Spec hash is different from actual object Spec hash after update. ### Describe how you validated your changes Change validate in a testing Kubernetes cluster. It can be reproduced by creating a remote autoscaler with a recent version of the Cluster Agent/CRD that has defaulting for local fallback values. Added a unit test to cover this specific case. ### Additional Notes The current solution is not optimal and we cannot cope with differences manually done in case Cluster Agent leader was not available at the time of the update. Co-authored-by: Vincent Boulineau <[email protected]>
… sizes (#41359) Backport 5d7eaae from #41293. ___ ### What does this PR do? ### Motivation https://datadoghq.atlassian.net/browse/VULN-12554 https://nvd.nist.gov/vuln/detail/CVE-2025-8194 https://gist.github.com/sethmlarson/1716ac5b82b73dbcbf23ad2eff8b33e1 ### Describe how you validated your changes ### Additional Notes Co-authored-by: Kyle Neale <[email protected]>
…-definitions to 2a6d59a9b3f3a7a6c91630515ad6ee659256b9a2 (#41437) This PR was automatically created by the test-infra-definitions bump task. This PR bumps the test-infra-definitions submodule to 2a6d59a9b3f3a7a6c91630515ad6ee659256b9a2 from 1faea1273955. Here is the full changelog between the two commits: DataDog/test-infra-definitions@1faea12...2a6d59a⚠️ This PR is opened with the `qa/no-code-change` and `changelog/no-changelog` labels by default. Please make sure this is appropriate ### What does this PR do? ### Motivation ### Describe how you validated your changes ### Additional Notes --------- Co-authored-by: agent-platform-auto-pr[bot] <153269286+agent-platform-auto-pr[bot]@users.noreply.github.com> Co-authored-by: Célian Raimbault <[email protected]>
Backport 02bfb4b from #41435. ___ ### What does this PR do? Unpins the install script in the test where it is pinned ### Motivation Test is currently failing because the pin is >5 versions old ### Describe how you validated your changes E2E ### Additional Notes Co-authored-by: Baptiste Foy <[email protected]>
Fixes the generic container corecheck processor to receive the parsed
value for ExtendedMemory collection
Address bug with the parsing and configuration of extended memory metric
collection in the containers processor
Deploy Agent locally with container check enabled with extended memory
metric collection enabled:
```
datadog:
confd:
container.yaml: |-
ad_identifiers:
- _container
init_config:
instances:
- extended_memory_metrics: true
```
Run the container corecheck and expect extended memory metrics like
`container.memory.active_file` to be outputted.
```
k exec -it datadog-agent-linux-bj667 -n datadog-agent -- agent check container | grep container.memory.active_file
Defaulted container "agent" out of: agent, trace-agent, process-agent, init-volume (init), init-config (init)
"metric": "container.memory.active_file",
"metric": "container.memory.active_file",
"metric": "container.memory.active_file",
"metric": "container.memory.active_file",
```
Follow up e2e tests can be added for detecting the presence of extended
memory metrics
(cherry picked from commit 8f75bfd)
### What does this PR do?
### Motivation
### Describe how you validated your changes
### Additional Notes
Backport 91376d3 from #41428. ___ This PR fixes a bug preventing the update of configurations through Fleet on windows. With the recent switch to modifying the user configuration dir directly we failed to account that the "configuration" directory on windows also contains all the runtime files (python cache, installer state, etc...). This broke the current implementation of the experiment. This PR fixes the issue by disabling config experiments on windows. ### QA This change was QA'd manually. Co-authored-by: Arthur Bellal <[email protected]>
…only (#41508) Backport 46246af from #41506. ___ ### What does this PR do? Skips cgroup tests for `pkg/gpu` when the cgroupfs is not writable. Additionally, it marks the oracle job as allowed to fail, as it started to fail at the same point as the pkg/gpu tests. ### Motivation FIx for #incident-43807 ### Describe how you validated your changes CI green. These tests also execute in KMT, where we have full access to the cgroup, so we do not lose coverage. ### Additional Notes Co-authored-by: Guillermo Julián <[email protected]>
… generation (#41509) Backport d950e76 from #41507. ___ ### What does this PR do? Don't use short lived tokens in the CI. ### Motivation ### Describe how you validated your changes ### Additional Notes Co-authored-by: Célian Raimbault <[email protected]>
Contributor
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
changelog/no-changelog
qa/done
QA done before merge and regressions are covered by tests
team/usm
The USM team
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fix issue for costumer - wip
Motivation
Describe how you validated your changes
Additional Notes