Skip to content

Conversation

@amitslavin
Copy link
Contributor

What does this PR do?

Fix issue for costumer - wip

Motivation

Describe how you validated your changes

Additional Notes

FlorentClarret and others added 30 commits September 5, 2025 09:14
Backport 6f75013 from #40576.

___

### What does this PR do?
Adds a `conf.yaml.example` for the Versa integration

### Motivation
Prepare Versa integration for public use

### Describe how you validated your changes
Ran through the versa.go to ensure that the expected configuration
options are exposed.

### Additional Notes

Co-authored-by: Ken Schneider <[email protected]>
Co-authored-by: Dustin Long <[email protected]>
…40658)

This is a manual backport of #40639 to 7.71.x due to merge conflicts.

### What does this PR do?
This PR fixes an issue where datadog-traceroute was potentially
incorrectly handling file descriptors on linux. More details on the
datadog-traceroute PR: [\[linux\] Fix linuxSink file descriptor
handling](DataDog/datadog-traceroute#24)

### Motivation
Errors from the current 7.71 RC

### Describe how you validated your changes
This was triggered by the finalizer of `*os.File` -- the datadog
traceroute library now has zero manual calls to unix.Close() on linux
which was the source of this issue. It purely uses os.File's handling to
close files
### Motivation

This fixes issues with S3 cache by fetching the target branch when doing
`git merge-base` operations.

### Describe how you validated your changes

### Additional Notes

#incident-42773
Backport ff4c8bb from #40623.

___

This PR fixes a bug found during QA where leftover config/package
experiments could be picked up during unrelated package/config
experiments.

This should ensure a config experiment always uses the stable package
and a package experiment always uses the stable config.

Co-authored-by: Arthur Bellal <[email protected]>
Backport 6936e9f from #40607.

___

These tests were poorly conceived. I wanted to make sure that the tests
still ran with a low depth limit, but I wasn&#39;t validating much. The
way to get the test to avoid having expectations was to tell it it was
in rewrite mode. The downside of being in rewrite mode is we do not know
how many events to expect, so we just wait a while. This turns out to be
problematic and can cause flakes in production. We do exercise a bunch
of depth limits explicitly. For now, just remove this bad subtest.

Co-authored-by: ajwerner <[email protected]>
Co-authored-by: piob-io <[email protected]>
Backport a8014fd from #40680.

___

### What does this PR do?

Following #40345, the
scrubber was scrubbing debugging information, this PR aims to solve this
issue

### Motivation

Avoid decreasing support efficiency by not be able to read configuration
for secret feature:
```
root@datadog-qn879:/tmp# ./agent config | grep secret_       
    secret_name: &quot;********&quot;
secret_audit_file_max_size: &quot;********&quot;
secret_backend_arguments: &quot;********&quot;
secret_backend_command: &quot;********&quot;
secret_backend_command_allow_group_exec_perm: &quot;********&quot;
secret_backend_config: &quot;********&quot;
secret_backend_output_max_size: &quot;********&quot;
secret_backend_remove_trailing_line_break: &quot;********&quot;
secret_backend_skip_checks: &quot;********&quot;
secret_backend_timeout: &quot;********&quot;
secret_backend_type: &quot;********&quot;
secret_image_to_secret: &quot;********&quot;
secret_kubernetes: &quot;********&quot;
secret_refresh_interval: &quot;********&quot;
secret_refresh_scatter: &quot;********&quot;
```

### Describe how you validated your changes

CI

### Additional Notes

Co-authored-by: louis-cqrl <[email protected]>
…ronously (#40687)

### What does this PR do?
This fixes a missing nil check by pulling in the datadog-traceroute PR
[\[filters\] Check for nil in
SetBPFAndDrain](DataDog/datadog-traceroute#28).
Normally this almost always has an error, but occasionally
`MSG_DONTWAIT` can finish synchronously.
### Motivation
Rarely we get this error running netpath at scale in staging:
```
SYS-PROBE | ERROR | (cmd/system-probe/modules/traceroute.go:72 in func1) | unable to run traceroute for host: 10.128.11.211: UDP traceroute failed to set packet filter: SetPacketFilter failed to apply BPF filter: SetBPFAndDrain failed to drain: %!w(<nil>)
```

Apparently this syscall can occasionally finish synchronously which
results in a nil error. I think it happens on hosts with barely any
network traffic.

### Describe how you validated your changes
In datadog-traceroute, I ran the packet filtering suite 100 times:
```
sudo env PATH="$PATH" go test -v -tags linux,test,root github.com/DataDog/datadog-traceroute/packets -count 100
```
(I had to make a few extra changes to get the root privileged test suite
compiling: [\[packets\] Use os.File to close all
fds](DataDog/datadog-traceroute#25))

---------

Co-authored-by: sabrina lu <[email protected]>
Backport c86042b from #40650.

___

### What does this PR do?
Adds the Versa integration to the build tasks as a core check so the
`conf.yaml.example` is included in the build.

### Motivation

### Describe how you validated your changes

### Additional Notes

Co-authored-by: Ken Schneider <[email protected]>
…gths (#40775)

Backport 4b43b06 from #40765.

___

This fixes a bug that occurs when compressed length is longer than
compressed one.

Add a cli for running irgen.

Co-authored-by: Piotr Bejda <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
…es (#40788)

Backport d4093ba from #40779.

___

#### dyninst/rcscrape: mark if a data item had a failed read

We&#39;ve seen data items missing related to a runtime ID. Perhaps we
failed to read those memory addresses. We&#39;d like to know about that.

Note that this same treatment should get applied to regular decoding
(perhaps in a more principled way). That&#39;s left for a different
change
that won&#39;t be backported.


#### dyninst/rcscrape: gracefully handle decoding failures

If we can&#39;t decode a message, we don&#39;t want to shut down the
entire
dyninst subsystem. The fact that that&#39;s what happens when decoding
fails is not great, but it&#39;s not for this change we intend to
backport.

There&#39;s an upcoming refactor to address that.

Fixes [DEBUG-4455](https://datadoghq.atlassian.net/browse/DEBUG-4455).


[DEBUG-4455]:
https://datadoghq.atlassian.net/browse/DEBUG-4455?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: ajwerner <[email protected]>
…tion/successful_async_initialization` (#40799)

Backport e0d98b0 from #40642.

___

### What does this PR do?
#### The Test Bug
`TestAsyncInitialization/successful_async_initialization` was flaking
due to a race condition deadlock. This PR fixes the test to un-flake the
test. This is not a bug in the ImageResolver feature, but a bug
specifically in the way the test is configured.

Prior to this change, the following race condition would cause a
deadlock:
1. `mockClient := newMockRCClient(...)` -&gt; `blockGetConfigs`: `false`
and `configsReady`: `make(chan struct{})`
2. `mockClient.setBlocking(true)` -&gt; mutex lock **HOLD**,
`blockGetConfigs`: `true`, mutex lock **RELEASE**
3. `resolver := newRemoteConfigImageResolverWithRetryConfig(...)`
- `go func() waitForInitialConfig()` -&gt; `rcClient.GetConfigs(...)`
-&gt; mutex lock **HOLD** until `configsReady` closed
4. `mockClient.setBlocking(false)` -&gt; mutex lock **HOLD** ... ❌
cannot close `configsReady` because mutex lock by goroutine.

If step 3 (go routine) would get the mutex lock first, this would create
a deadlock situation. If step 4 (test) gets the mutex lock first, then
it would successfully close the `configsReady` channel, release the
mutex lock, and then the goroutine would hold the mutex lock, do its
work, then release the mutex.

#### The Fix
The fix for the above is to explicitly release the mutex hold in
`GetConfig(...)` after we store the state of `m.blockGetConfigs` and
`m.configsReady` in local variables first. Then we block
`GetConfig(...)` on those local vars. This way, regardless of which step
obtains the mutex first, it will be released properly.

### Motivation
Un-flake the `TestAsyncInitialization/successful_async_initialization`
test for the ImageResolver to avoid blocking CI.

See example of failure from this
[here](https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/1113117038).

### Describe how you validated your changes
Reproduced CI issue locally with artificial delays. The above fix
prevents this race condition consistently.

### Additional Notes

Co-authored-by: erikayasuda <[email protected]>
### Motivation

Backports #40685 and #40798
that fixes some edge cases for incident-42773.

### Describe how you validated your changes

### Additional Notes
…eResolver.Resolve()` (#40785)

Backport 730bb5d from #40697.

___

### What does this PR do?
Adds `image_resolution_attempts` telemetry to the newly added
`ImageResolver.Resolve()` function to track image resolution attempts,
and whether they resulted in a successful image digest resolution, or if
they defaulted to using the mutable tag.

### Motivation
This telemetry is necessary for the new K8s SSI gradual rollout feature
in order to determine whether a rollout is working successfully or not.
The goal is for this to be easier to display on a dashboard:

| `registry` | `repository` | `digest_resolution` | `outcome` | Notes |
|----------|----------|----------|----------|----------|
| `gcr.io/datadoghq` | `dd-lib-python-init` | `enabled` |
`sha256:abc123` | ✅ Was supposed to resolve, did resolve |
| `hub.docker.com/r/datadog` | `dd-lib-java-init` | `enabled` | `v2` | ❌
Was supposed to resolve, but it did NOT resolve |
| `mycustomregistry.org` | `dd-lib-php-init` | `enabled` | `v2` | ✅
Cannot resolve for custom registry, did NOT resolve |
| `gallery.ecr.aws/datadog` | `dd-lib-rb-init` | `disabled` | `v1` | ✅
Was NOT supposed to resolve, did NOT resolve |
| `gcr.io/datadoghq` | `dd-lib-dotnet-init` | `disabled` |
`sha256:abc123` | ❌ Was NOT supposed to resolve, did resolve *this
shouldn&#39;t be possible |


### Describe how you validated your changes
Ran a local app via `injector-dev` to verify that the telemetry counts
for `apm-inject` and `dd-lib-python-init` were accurate.

### Additional Notes

Co-authored-by: erikayasuda <[email protected]>
Co-authored-by: sabrina lu <[email protected]>
Backport d3ce06c from #40789.

___

### What does this PR do?
This commit fixes an issue where we only resolve image tags during
startup and not on every pod mutation by ensuring the resolution happens
just before use.

### Motivation
We&#39;ve added gradual rollout support for Single Step Instrumentation
so that language libraries are released in a gradual fashion in
#39915. This was missed
during code review and caught during testing.

### Describe how you validated your changes
I tested this using
[injector-dev](https://github.com/DataDog/injector-dev):
```
injector-dev apply -f dev.yaml --profile staging --build
```

<details>
  <summary>dev.yaml</summary>

```yaml
helm:
  apps:
    - name: gradual-rollout-test
      namespace: application
      values:
        env:
          - name: DD_TRACE_DEBUG
            value: &quot;true&quot;
          - name: DD_APM_INSTRUMENTATION_DEBUG
            value: &quot;true&quot;
        image:
          repository: registry.ddbuild.io/ci/injector-dev/python
          tag: 2cd78ded
        podLabels:
          language: python
          tags.datadoghq.com/env: local
        service:
          port: &quot;8080&quot;
  versions:
    agent: 7.69.1
    cluster_agent:
      version: 7.69.1
      build: {}
    injector:
      version: 0.44.0
  config:
    clusterAgent:
      env:
        - name: DD_REMOTE_CONFIGURATION_ENABLED
          value: &quot;true&quot;
    datadog:
      site: &quot;datad0g.com&quot;
      apm:
        instrumentation:
          enabled: true
          targets:
            - name: python
              podSelector:
                matchLabels:
                  language: python
              ddTraceVersions:
                python: default
```

</details>

### Additional Notes
We will need this backported to `7.71.x`

Co-authored-by: Mark Spicer <[email protected]>
Co-authored-by: adel121 <[email protected]>
…history from status and allow directional-only fallback (#40823)

Backport f65e87a from #40542

___

### What does this PR do?

* Load recommendation history from new `LastRecommendations` field in CR
* Allow local fallback to only be applied if it is for upscale /
downscale / for both

### Motivation

Loading recommendation history allows us to more accurately determine
the stabilized recommendation

Changes to fallback allow us to shorten the time to enable fallback by
providing users the flexibility to set how they'd like fallback to be
activated

### Describe how you validated your changes

1. Deploy these changes; set up DPA CR to set fallback direction
2. Verify that fallback is only enabled when the scaling direction
matches the enabled direction
3. Restart the cluster agent - verify that DPAs in store is populated
with data from the CR for recommendation history

### Additional Notes
… checks (#40767)

Backport f10a8b6 from #40532.

___

### What does this PR do?
Stop using GET /probe endpoints to perform connectivity checks.
Replaced by POST with empty payloads or removed.

### Motivation
Support question from a concerned customer
In the connectivity check, we use GET /probe endpoints. 
These endpoints are exposed by synthetics which is available depending
on customer orgs setup and causes 403 when the endpoint is not
available.

### Describe how you validated your changes
Manual QA

### Additional Notes

Co-authored-by: san-jos <[email protected]>
…40832)

Backport 4dc700f from #40827.

___

Before this change, we wouldn&#39;t send fresh diagnostics for updated
probes, making it seem like we haven&#39;t installed them.

Fixes https://datadoghq.atlassian.net/browse/DEBUG-4467

Co-authored-by: ajwerner <[email protected]>
…tchers for api/app keys & common HTTP auth headers (#40837)

Backport 2643133 from #40774.

___

### What does this PR do?
- **Normalizes YAML keys to lowercase before matching** so scrubbing is
case-insensitive across config variants.
  - In `ScrubDataObj`, keys are lowercased for the `YAMLKeyRegex` check.
- **Expands key match coverage**:
  - `api_key` → `api[-_]?key`
  - `ap(?:p|plication)_?key` → `ap(?:p|plication)[-_]?key`
- Adds support for **HTTP header-style fields** via
`matchYAMLKeyPrefixSuffix(&quot;x-&quot;,&quot;key|token|auth&quot;, …)`
and explicit lists:
- `x-api-key`, `x-rapidapi-key`, `x-functions-key`, `x-octopus-apikey`,
`x-dreamfactory-api-key`, `x-lz-api-key`, `x-pm-partner-key`,
`x-sungard-idp-api-key`, `x-vtex-api-appkey`
    - `x-auth-token`, `x-rundeck-auth-token`
    - `x-auth`, `x-stratum-auth`
  - Adds **exact key matches** for common auth fields:
- `auth-tenantid`, `authority`, `cainzapp-api-key`, `cms-svc-api-key`,
`lodauth`, `sec-websocket-key`, `statuskey`
- **Regex robustness**:
- Allow hyphens in YAML key detectors (`(\w|_|-)`) in `matchYAMLKeyPart`
and `matchYAMLKeyEnding`.
  - New helper: `matchYAMLKeyPrefixSuffix`.
- **Version metadata**:
  - Sets `LastUpdated` to `7.70.2` for updated replacers.
- **Tests**:
- `pkg/util/scrubber/default_test.go`: new
`TestNewHTTPHeaderAndExactKeys`.
- `pkg/util/scrubber/yaml_scrubber_test.go`: comprehensive table-driven
cases for case/format variants.

### Motivation
Configurations and headers vary widely in **case** (`APIKEY`, `Api-Key`)
and **separators** (`_` vs `-`). Prior scrubbing missed many real-world
keys, potentially leaking secrets in logs/support bundles. Lowercasing
keys for matching + expanding patterns closes these gaps without
over-scrubbing generic keys.

### Describe how you validated your changes
- **Unit tests** (new &amp; updated) cover:
  - Case variants: `APIKEY`, `Api_key`, `api-key`, `apikey`
  - HTTP headers with `x-` prefix and `key`/`token`/`auth` suffixes
  - Exact-match auth fields
  - Non-matching benign keys remain untouched

Co-authored-by: louis-cqrl <[email protected]>
Co-authored-by: sabrina lu <[email protected]>
Backport 909cc7d from #40565.

___

### What does this PR do?
Simply adds minified version of several javascript files and replaces
originals in deliverables
to optimize their size.

Co-authored-by: Joseph Gette <[email protected]>
…name as key (#40871)

Backport 191acf6 from #40762.

___

### What does this PR do?
Restructures the `ImageResolver` cache structure so that the keys are
only the repository names (ex - `dd-lib-python-init`) and not the
repository URL (ex - `gcr.io/dd-lib-python-init`).

It also enables us to configure the default Datadog container registries
via configuration. This is not public-facing, and will be used primarily
for dogfooding the new feature on staging (given staging uses different
container registries).

### Motivation
The repository URL field in the `K8S_INJECTION_DD` remote config data
was not designed to consumed by DCA, but primarily for use by our
internal `equilibrium` tooling. This meant that by using the repository
URL for the key, it would be limited to only allowing gradual rollout
for customers using `grc.io/datadoghq` as their registry. See
[here](https://docs.datadoghq.com/tracing/trace_collection/automatic_instrumentation/single-step-apm/kubernetes/?tab=agentv764recommended#change-the-default-image-registry)
for other valid Datadog registries.

### Describe how you validated your changes
- Updated existing tests
- Added new unit tests
- Local E2E testing with `injector-dev`

### Additional Notes

Co-authored-by: erikayasuda <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
…imit (#40876)

Backport e693ea6 from #40860.

___

### What does this PR do?

Before this change, we&#39;d flush *after* adding a message that puts
the batch over the limit. Now we&#39;ll flush the current buffer before
exceeding the limit.

Fixes https://datadoghq.atlassian.net/browse/DEBUG-4480

### Motivation

We&#39;ve been seeing 413 errors in staging.

### Describe how you validated your changes

Added testing.

Co-authored-by: ajwerner <[email protected]>
…I are properly sorted (#40895)

Backport 7be3f24 from #40889.

___

### What does this PR do?

Ensures that the devices returned by the PodResources API are properly
sorted before being returned. In some cases we have seen k8s returning
them in a different order than what they will seen as in the system.

### Motivation

Ensure correct attribution of devices.

https://datadoghq.atlassian.net/browse/EBPF-813

### Describe how you validated your changes

Added unit tests.

### Additional Notes

Co-authored-by: Guillermo Julián <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
Backport 7bd16cd from #40924.

___

### What does this PR do?

Fixes #39647 by closing the
root after using it.

### Motivation

### Describe how you validated your changes

### Additional Notes

Co-authored-by: Paul Cacheux <[email protected]>
dd-octo-sts bot and others added 26 commits September 24, 2025 11:16
…by main consumer loop (#41226)

Backport 74a7786 from #41193.

___

### What does this PR do?

Changes the processing of process exit events so that they happen in the
main consumer loop.

### Motivation

Avoid race conditions by ensuring process exits are handled in the same
goroutine, no matter whether they come from process scans or from the
process monitor.

### Describe how you validated your changes

Added unit tests to ensure process exit is properly handled.

### Additional Notes

Co-authored-by: Guillermo Julián <[email protected]>
The kubelet entity is currently only used to generate a tag on
kubernetes check metrics, it has no tags of its own.

(cherry picked from commit 34f0e21)

### What does this PR do?
Fixes the error logs that occur when trying to build the kubelet's
tagger entity ID

### Motivation
Reduce error logs

### Describe how you validated your changes
1. Deploy RC 7 and run `agent check kubelet` to kickoff the workloadmeta
kubelet collector
```
root@justin-lesko-minikube:/# agent check kubelet | grep "kubelet-id"
2025-09-23 19:05:49 UTC | CORE | ERROR | (comp/core/tagger/common/entity_id_builder.go:35 in BuildTaggerEntityID) | can't recognize entity "kubelet-id" with kind "kubelet"; trying kubelet-id://kubelet as tagger entity
2025-09-23 19:05:49 UTC | CORE | ERROR | (comp/core/tagger/collectors/workloadmeta_extract.go:161 in processEvents) | cannot handle event for entity "kubelet-id" with kind "kubelet"
```

2. Deploy this branch and observe no more errors
```
root@justin-lesko-minikube:/# agent check kubelet | grep "kubelet-id"
root@justin-lesko-minikube:/#
```

### Additional Notes
…r older CRs (#41234)

Backport 0b9ffe1 from #41212.

___

### What does this PR do?

Fixes a case where the CR does not contain any value preventing from
applying any fallback.
Fixes feature introduced in
#40542

### Motivation

Bugfix

### Describe how you validated your changes

Use a 7.71+ Agent version with an outdated CRs or CRD, local fallback
should still happen instead of being refused due to scaling direction
disabled.

### Additional Notes

Cluster Agent impact only

Co-authored-by: Vincent Boulineau <[email protected]>
…d for psycopg #41003 (#41238)

### What does this PR do?

### Motivation

### Describe how you validated your changes

### Additional Notes

---------

Co-authored-by: Florent Clarret <[email protected]>
Co-authored-by: sabrina-datadog <[email protected]>
Co-authored-by: dd-octo-sts[bot] <200755185+dd-octo-sts[bot]@users.noreply.github.com>
…ClosedProcesses flaky test (#41310)

Backport ec77d23 from #41302.

___

### What does this PR do?

Fixes a low frequency flaky test. The way the test was written, the
`procRoot` with the fake data was not being actually used to check for
the closed processes (it was being passed to the context but not the
consumer), and instead it was using the real `/proc` path. This made the
test work most of the time, except when we had a real process with the
PID of our fake process.

### Motivation

Eliminate flaky tests.

### Describe how you validated your changes

Fixed the unit test.

### Additional Notes

Co-authored-by: Guillermo Julián <[email protected]>
…re release (#41314)

### What does this PR do?
Backports #41196 to 7.71.x

### Motivation
Making sure tests don't break on 7.71.x when a new install script gets
released

### Describe how you validated your changes
If E2E tests pass we're good!

### Additional Notes

---------

Co-authored-by: Arthur Bellal <[email protected]>
Backport c46abad from #40181.

___

&lt;!--
* Contributors are encouraged to read our
[CONTRIBUTING](/CONTRIBUTING.md) documentation.
* Both Contributor and Reviewer Checklists are available at
https://datadoghq.dev/datadog-agent/guidelines/contributing/#pull-requests.
* The pull request:
  * Should only fix one issue or add one feature at a time.
  * Must update the test suite for the relevant functionality.
  * Should pass all status checks before being reviewed or merged.
* Commit titles should be prefixed with general area of pull
request&#39;s change.
* Please fill the below sections if possible with relevant information
or links.
--&gt;
### What does this PR do?

&gt; [!NOTE]
&gt; Buildimages are also bumped in this PR.

This migrates Gitlab PATs, now we use a custom API to generate short
lived Gitlab PATs.

### Motivation

### Describe how you validated your changes
&lt;!--
Validate your changes before merge, ensuring that:
* Your PR is tested by static / unit / integrations / e2e tests
* Your PR description details which e2e tests cover your changes, if any
* The PR description contains details of how you validated your changes.
If you validated changes manually and not through automated tests, add
context on why automated tests did not fit your changes validation.

If you want additional validation by a second person, you can ask
reviewers to do it. Describe how to set up an environment for manual
tests in the PR description. Manual validation is expected to happen on
every commit before merge.

Any manual validation step should then map to an automated test. Manual
validation should not substitute automation, minus exceptions not
supported by test tooling yet.
--&gt;

### Possible Drawbacks / Trade-offs

### Additional Notes
&lt;!--
* Anything else we should know when reviewing?
* Include benchmarking information here whenever possible.
* Include info about alternatives that were considered and why the
proposed
  version was chosen.
--&gt;

Co-authored-by: Célian Raimbault <[email protected]>
### What does this PR do?

fix release notes

### Motivation

### Describe how you validated your changes

### Additional Notes
…onfig stream snapshot creation (#41289)

Backport e4fcdfc from #41279.

___

### What does this PR do?
Updates config stream to handle nested keys in the configuration that
are type `map[interface{}]interface{}`

### Motivation

Previously, we were only converting the top level to
`map[string]interface`

### Describe how you validated your changes
1. Create custom image and deployed it on an experimental cluster
(sasquatch) with ADP enabled.
2. Run `agent-data-plane config` and see the config.
3. Update a setting (ex: `agent config set dogstatd_stats true`)
4. . Run `agent-data-plane config` and see the updated config value.

### Additional Notes

Co-authored-by: Raymond Zhao <[email protected]>
Backport a4e26e7 from #41166.

___

### What does this PR do?
don&#39;t pass context when launching detached process

### Motivation
https://datadoghq.atlassian.net/browse/WINA-1666
https://datadoghq.atlassian.net/browse/WINA-1707
fix bug that can cause fleet upgrades to fail

### Describe how you validated your changes
existing E2E tests fail due to this issue once every few days

manual test: added sleep after `i.stop()` to give time for background
terminate process to finish

### Additional Notes
before returning from main, `hookCommand` calls defer `i.stop(err)`

https://github.com/DataDog/datadog-agent/blob/4c3ba894409b1fde22ed134e0b7a2eb518b80297/pkg/fleet/installer/commands/hooks.go#L37-L38
`i.stop` calls `c.stopsighandler`

https://github.com/DataDog/datadog-agent/blob/4c3ba894409b1fde22ed134e0b7a2eb518b80297/pkg/fleet/installer/commands/command.go#L73-L80
which is the context cancelfunc

https://github.com/DataDog/datadog-agent/blob/4c3ba894409b1fde22ed134e0b7a2eb518b80297/pkg/fleet/installer/commands/command.go#L46-L55

Since this context was passed to `exec.CommandContext` when launching
the detached `postStartExperimentBackground` subprocess, it sees
`ctx.Done()` and sends a kill signal to the subprocess.

Co-authored-by: Branden Clark <[email protected]>
…hes do not match (#41324)

Backport 8fb9d9b from #41301.

___

### What does this PR do?

This PR fixes the behaviour of the Cluster Agent when the `.Spec`
retrieved _after_ a DPA has been created/updated with values not
originating from the Cluster Agent itself.

The discrepancy can come from different sources: admission controller,
CRD defaulting, etc. which are not necessarily expected but should not
break the overall feature either.

### Motivation

Fixing inactive remote Autoscalers when Remote Spec hash is different
from actual object Spec hash after update.

### Describe how you validated your changes

Change validate in a testing Kubernetes cluster. It can be reproduced by
creating a remote autoscaler with a recent version of the Cluster
Agent/CRD that has defaulting for local fallback values.

Added a unit test to cover this specific case.

### Additional Notes

The current solution is not optimal and we cannot cope with differences
manually done in case Cluster Agent leader was not available at the time
of the update.

Co-authored-by: Vincent Boulineau <[email protected]>
…-definitions to 2a6d59a9b3f3a7a6c91630515ad6ee659256b9a2 (#41437)

This PR was automatically created by the test-infra-definitions bump
task.

This PR bumps the test-infra-definitions submodule to
2a6d59a9b3f3a7a6c91630515ad6ee659256b9a2 from 1faea1273955. Here is the
full changelog between the two commits:
DataDog/test-infra-definitions@1faea12...2a6d59a

⚠️ This PR is opened with the `qa/no-code-change` and
`changelog/no-changelog` labels by default. Please make sure this is
appropriate

### What does this PR do?

### Motivation

### Describe how you validated your changes

### Additional Notes

---------

Co-authored-by: agent-platform-auto-pr[bot] <153269286+agent-platform-auto-pr[bot]@users.noreply.github.com>
Co-authored-by: Célian Raimbault <[email protected]>
### What does this PR do?
Backports #41375 &
#41458 to 7.71.x

### Motivation
Fixing the APM SSI script

### Describe how you validated your changes
E2E, manual QA

### Additional Notes
Backport 02bfb4b from #41435.

___

### What does this PR do?
Unpins the install script in the test where it is pinned

### Motivation
Test is currently failing because the pin is &gt;5 versions old

### Describe how you validated your changes
E2E

### Additional Notes

Co-authored-by: Baptiste Foy <[email protected]>
Fixes the generic container corecheck processor to receive the parsed
value for ExtendedMemory collection

Address bug with the parsing and configuration of extended memory metric
collection in the containers processor

Deploy Agent locally with container check enabled with extended memory
metric collection enabled:

```
datadog:
  confd:
    container.yaml: |-
      ad_identifiers:
        - _container
      init_config:
      instances:
        - extended_memory_metrics: true
```

Run the container corecheck and expect extended memory metrics like
`container.memory.active_file` to be outputted.
```
k exec -it datadog-agent-linux-bj667 -n datadog-agent -- agent check container | grep container.memory.active_file
Defaulted container "agent" out of: agent, trace-agent, process-agent, init-volume (init), init-config (init)
    "metric": "container.memory.active_file",
    "metric": "container.memory.active_file",
    "metric": "container.memory.active_file",
    "metric": "container.memory.active_file",
```

Follow up e2e tests can be added for detecting the presence of extended
memory metrics

(cherry picked from commit 8f75bfd)

### What does this PR do?

### Motivation

### Describe how you validated your changes

### Additional Notes
Backport 91376d3 from #41428.

___

This PR fixes a bug preventing the update of configurations through
Fleet on windows.

With the recent switch to modifying the user configuration dir directly
we failed to account that the &quot;configuration&quot; directory on
windows also contains all the runtime files (python cache, installer
state, etc...). This broke the current implementation of the experiment.

This PR fixes the issue by disabling config experiments on windows.

### QA

This change was QA&#39;d manually.

Co-authored-by: Arthur Bellal <[email protected]>
…only (#41508)

Backport 46246af from #41506.

___

### What does this PR do?

Skips cgroup tests for `pkg/gpu` when the cgroupfs is not writable.

Additionally, it marks the oracle job as allowed to fail, as it started
to fail at the same point as the pkg/gpu tests.

### Motivation

FIx for #incident-43807

### Describe how you validated your changes

CI green. These tests also execute in KMT, where we have full access to
the cgroup, so we do not lose coverage.

### Additional Notes

Co-authored-by: Guillermo Julián <[email protected]>
… generation (#41509)

Backport d950e76 from #41507.

___

### What does this PR do?

Don&#39;t use short lived tokens in the CI.

### Motivation

### Describe how you validated your changes

### Additional Notes

Co-authored-by: Célian Raimbault <[email protected]>
@amitslavin amitslavin added changelog/no-changelog team/usm The USM team qa/done QA done before merge and regressions are covered by tests labels Nov 30, 2025
@agent-platform-auto-pr
Copy link
Contributor

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 20c9c84

Successful checks

Info

Quality gate Delta On disk size (MiB) Delta On wire size (MiB)
agent_deb_amd64 DataNotFound $${694.01}$$ < $${709.39}$$ DataNotFound $${174.99}$$ < $${178.58}$$
agent_deb_amd64_fips DataNotFound $${688.52}$$ < $${703.09}$$ DataNotFound $${173.47}$$ < $${178.12}$$
agent_heroku_amd64 DataNotFound $${332.37}$$ < $${355.37}$$ DataNotFound $${88.24}$$ < $${95.72}$$
agent_msi DataNotFound $${981.8}$$ < $${986.02}$$ DataNotFound $${150.15}$$ < $${152.67}$$
agent_rpm_amd64 DataNotFound $${694.0}$$ < $${709.38}$$ DataNotFound $${177.29}$$ < $${181.22}$$
agent_rpm_amd64_fips DataNotFound $${688.51}$$ < $${703.08}$$ DataNotFound $${175.75}$$ < $${179.85}$$
agent_rpm_arm64 DataNotFound $${680.4}$$ < $${695.74}$$ DataNotFound $${158.68}$$ < $${163.96}$$
agent_rpm_arm64_fips DataNotFound $${675.98}$$ < $${693.05}$$ DataNotFound $${157.54}$$ < $${163.0}$$
agent_suse_amd64 DataNotFound $${694.0}$$ < $${709.38}$$ DataNotFound $${177.29}$$ < $${181.22}$$
agent_suse_amd64_fips DataNotFound $${688.51}$$ < $${703.08}$$ DataNotFound $${175.75}$$ < $${179.85}$$
agent_suse_arm64 DataNotFound $${680.4}$$ < $${695.74}$$ DataNotFound $${158.68}$$ < $${163.96}$$
agent_suse_arm64_fips DataNotFound $${675.98}$$ < $${693.05}$$ DataNotFound $${157.54}$$ < $${163.0}$$
docker_agent_amd64 DataNotFound $${765.4}$$ < $${788.65}$$ DataNotFound $${262.37}$$ < $${272.01}$$
docker_agent_arm64 DataNotFound $${775.78}$$ < $${802.0}$$ DataNotFound $${248.47}$$ < $${259.7}$$
docker_agent_jmx_amd64 DataNotFound $${956.28}$$ < $${979.84}$$ DataNotFound $${330.99}$$ < $${340.95}$$
docker_agent_jmx_arm64 DataNotFound $${955.38}$$ < $${981.8}$$ DataNotFound $${313.1}$$ < $${324.65}$$
docker_cluster_agent_amd64 DataNotFound $${213.04}$$ < $${214.5}$$ DataNotFound $${72.32}$$ < $${73.51}$$
docker_cluster_agent_arm64 DataNotFound $${228.98}$$ < $${230.33}$$ DataNotFound $${68.58}$$ < $${69.77}$$
docker_cws_instrumentation_amd64 DataNotFound $${7.07}$$ < $${7.12}$$ DataNotFound $${2.95}$$ < $${3.29}$$
docker_cws_instrumentation_arm64 DataNotFound $${6.69}$$ < $${6.92}$$ DataNotFound $${2.71}$$ < $${3.07}$$
docker_dogstatsd_amd64 DataNotFound $${38.37}$$ < $${39.57}$$ DataNotFound $${14.82}$$ < $${15.76}$$
docker_dogstatsd_arm64 DataNotFound $${37.07}$$ < $${38.2}$$ DataNotFound $${14.27}$$ < $${14.83}$$
dogstatsd_deb_amd64 DataNotFound $${29.59}$$ < $${31.4}$$ DataNotFound $${7.81}$$ < $${8.95}$$
dogstatsd_deb_arm64 DataNotFound $${28.18}$$ < $${29.97}$$ DataNotFound $${6.76}$$ < $${7.89}$$
dogstatsd_rpm_amd64 DataNotFound $${29.59}$$ < $${31.4}$$ DataNotFound $${7.81}$$ < $${8.96}$$
dogstatsd_suse_amd64 DataNotFound $${29.59}$$ < $${31.4}$$ DataNotFound $${7.81}$$ < $${8.96}$$
iot_agent_deb_amd64 DataNotFound $${54.42}$$ < $${54.97}$$ DataNotFound $${13.73}$$ < $${14.45}$$
iot_agent_deb_arm64 DataNotFound $${51.71}$$ < $${51.9}$$ DataNotFound $${11.87}$$ < $${12.63}$$
iot_agent_deb_armhf DataNotFound $${51.29}$$ < $${51.84}$$ DataNotFound $${11.94}$$ < $${12.74}$$
iot_agent_rpm_amd64 DataNotFound $${54.42}$$ < $${54.97}$$ DataNotFound $${13.75}$$ < $${14.47}$$
iot_agent_suse_amd64 DataNotFound $${54.42}$$ < $${54.97}$$ DataNotFound $${13.75}$$ < $${14.47}$$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog qa/done QA done before merge and regressions are covered by tests team/usm The USM team

Projects

None yet

Development

Successfully merging this pull request may close these issues.