Skip to content

libbeat/testing: use ephemeral ports to avoid TOCTOU collisions#51617

Open
orestisfl wants to merge 13 commits into
elastic:mainfrom
orestisfl:tests/ephemeral-ports-toctou
Open

libbeat/testing: use ephemeral ports to avoid TOCTOU collisions#51617
orestisfl wants to merge 13 commits into
elastic:mainfrom
orestisfl:tests/ephemeral-ports-toctou

Conversation

@orestisfl

@orestisfl orestisfl commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Proposed commit message

libbeat/testing: use ephemeral ports to avoid TOCTOU collisions

Allows us to stress-test OTel tests.

Tests that needed a free TCP port called MustAvailableTCP4Port, which
bound :0, read the assigned port, closed the listener, and returned the
number. That available port can become unavailable before the caller
tries to use it. Running the same test many times in parallel (e.g. via
script/stresstest.sh -p 30) makes those cross-process collisions likely,
producing flaky bind failures.

Details:

Remove the MustAvailableTCP4Port(s) and AvailableTCP4Port(s) helpers and
convert every caller to let the OS assign an ephemeral port at bind time
and read the bound port back:

- OTel integration tests (filebeat/metricbeat/packetbeat) set
  http.port: 0 and discover the resolved monitoring port from the
  "Metrics endpoint listening on:" log line via the new shared
  libbeat/testing.ParseMonitoringPort, exposed through
  oteltestcol.Collector.MonitoringPort(s) (collector logs) and
  integration.BeatProc.MonitoringPort (beat log file).
- packetbeat's test HTTP servers bind ephemeral ports and the sniffer
  is configured from the resolved port.
- the filebeat ES-outage test and the heartbeat connection-refused
  tests bind :0 in-process, releasing the port for the
  connection-refused / restart cases.

The test OTel collector is also made parallel-safe: oteltestcol.New
merges service.telemetry.metrics.level: none into every collector
config so collectors no longer bind the fixed :8888 Prometheus port,
and Collector.Shutdown blocks until the collector has fully exited so
TestNoDuplicates can restart a collector without polling :8888 as an
exit signal. The now-redundant per-config telemetry metrics-off blocks
are removed.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • [ ] I have added an entry in ./changelog/fragments — test-only change, skip-changelog applied.

Disruptive User Impact

None. The change is limited to test code and helpers.

How to test this PR locally

The flakiness this fixes only shows up under parallel stress, so verify both the unit and integration tests stay green under load.

Unit tests

script/stresstest.sh ./heartbeat/monitors/active/tcp '^TestConnectionRefusedEndpointJob$' -p 30
script/stresstest.sh ./heartbeat/monitors/active/http '^TestConnRefusedJob$' -p 30

Both ran ~5 minutes with 0 failures (~150k and ~139k runs respectively).

Integration tests (OTel collectors)

The collectors no longer bind any fixed port, so many can run concurrently against a single Elasticsearch. Bring Elasticsearch up once, then stress each beat's OTel tests in parallel:

cd x-pack/filebeat && mage docker:composeUp   # one-time; the ES on :9200 is shared by all three beats
cd ../..

# Each OTel E2E run creates a per-run data stream. Raise the shard limit so a
# long parallel run does not hit the default 1000-shards-per-node cap
curl -s -u admin:testing -XPUT localhost:9200/_cluster/settings \
  -H 'Content-Type: application/json' \
  -d '{"persistent":{"cluster.max_shards_per_node":100000}}'

script/stresstest.sh --tags integration ./x-pack/filebeat/tests/integration   '^TestFilebeatOTelE2E$' -p 12
script/stresstest.sh --tags integration ./x-pack/filebeat/tests/integration   '^TestNoDuplicates$' -p 6
script/stresstest.sh --tags integration ./x-pack/metricbeat/tests/integration '^TestMetricbeatOTelE2E$|^TestMetricbeatOTelMultipleReceiversE2E$' -p 10
script/stresstest.sh --tags integration ./x-pack/packetbeat/tests/integration '^TestPacketbeatOTelE2E$|^TestPacketbeatOTelMultipleReceiversE2E$|^TestPacketbeatOTelBeatE2E$' -p 10

Remove `libbeattesting.MustAvailableTCP4Port` and the related
`AvailableTCP4Port`/`AvailableTCP4Ports`/`MustAvailableTCP4Ports`
helpers (and their tests).

WHY: these helpers found a "free" port by binding `:0`, reading the
assigned port, then closing the listener and returning the number. That
leaves a time-of-check/time-of-use window: the caller binds the port for
real only later, and in between the OS can hand the just-released port to
another process. Running the same test many times in parallel (e.g. via
`script/stresstest.sh -p 30`) makes those cross-process collisions likely,
producing flaky bind failures.

WHAT: convert every caller to let the OS assign an ephemeral port at bind
time and read the bound port back:

- otel integration tests (filebeat/metricbeat/packetbeat) set
  `http.port: 0` and discover the resolved monitoring port from the
  "Metrics endpoint listening on:" log line.
- New shared helpers do the discovery: `oteltestcol.Collector.MonitoringPort`
  /`MonitoringPorts` (from the collector's observed logs) and
  `integration.BeatProc.MonitoringPort` (from the beat process log file),
  both backed by `libbeat/testing.ParseMonitoringPort` so the log format
  lives in one place.
- packetbeat's test HTTP servers bind ephemeral ports and the sniffer is
  configured from the resolved port.
- the filebeat ES-outage test and the heartbeat connection-refused tests
  bind `:0` in-process (releasing for the connection-refused / restart
  cases), removing the early pre-allocation.

Verified with `script/stresstest.sh -p 30` (~5 min) on the heartbeat
connection-refused tests (0 failures over ~150k/~139k runs) and by running
the filebeat/metricbeat/packetbeat otel E2E tests against Elasticsearch.
The test OTel collector defaulted to a Prometheus telemetry reader on a
fixed localhost:8888, so two collectors could not run at once. Stress
runs (stresstest.sh -p N) of any collector-based integration test
therefore collided on that port.

Disable the collector's own telemetry metrics by default in
oteltestcol.New (service.telemetry.metrics.level: none) unless the test
set a level; the tests assert on Elasticsearch documents and the
per-receiver HTTP monitoring endpoint, not the collector's own metrics,
so this is safe. Collector.Shutdown now blocks until the collector has
fully exited so callers can restart a collector that reuses the same
path.home without racing teardown.

TestNoDuplicates previously polled :8888 to detect that the previous
collector had exited before restarting; the blocking Shutdown replaces
that, removing the last fixed-port dependency and letting these tests
run in parallel.
Small follow-up cleanups with no behavior change:

- oteltestcol.New: drop the unreachable `if err != nil` block after
  require.NoError and fold the WriteFile call into the assertion.
- libbeat/testing.monitoring: build the endpoint regex from
  MonitoringEndpointSnippet via regexp.QuoteMeta so the literal lives in
  one place.
- TestNoDuplicates: drop the redundant t.Cleanup after the collector
  restart; oteltestcol.New already registers shutdown via t.Cleanup.
oteltestcol.New now merges service.telemetry.metrics.level: none into
every test collector config, so the per-config blocks that disabled the
collector's own metrics (to avoid the fixed :8888 Prometheus port) are
redundant. Remove them.

Where a config only set the metrics level, the whole telemetry block is
dropped; where it also set telemetry.logs.level, only the metrics entry
is removed so the log level is preserved.
@orestisfl orestisfl added Team:obs-ds-hosted-services Label for the Observability Hosted Services team cleanup Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-active-all Automated backport with mergify to all the active branches skip-changelog labels Jun 29, 2026
@orestisfl orestisfl self-assigned this Jun 29, 2026
@botelastic botelastic Bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
  • /test : Run the Buildkite pipeline.

@orestisfl orestisfl added the Team:Security-Linux Platform Linux Platform Team in Security Solution label Jun 29, 2026
@orestisfl orestisfl marked this pull request as ready for review June 29, 2026 14:58
@orestisfl orestisfl requested a review from a team as a code owner June 29, 2026 14:58
@orestisfl orestisfl requested review from a team as code owners June 29, 2026 14:58
@orestisfl orestisfl requested review from mauri870 and rdner June 29, 2026 14:58
@infra-vault-gh-plugin-prod

Copy link
Copy Markdown

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@infra-vault-gh-plugin-prod

Copy link
Copy Markdown

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@infra-vault-gh-plugin-prod

Copy link
Copy Markdown

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR removes the libbeat/testing package's AvailableTCP4Port* helper functions and migrates all call sites to OS-assigned ephemeral ports via net.Listen(":0"). It adds ParseMonitoringPort and MonitoringEndpointSnippet to libbeat/tests/integration for extracting the bound monitoring port from Beat startup logs, plus a BeatProc.MonitoringPort(timeout) polling helper. The oteltestcol.Collector harness gains a done channel and sync.Once for deterministic shutdown, and now writes a metrics-off.yaml config override merged via a multi-URI newCollectorSettings. All OTel integration tests across filebeat, metricbeat, and packetbeat are updated to set http.port: 0 and retrieve actual ports via collector.MonitoringPort(t) / MonitoringPorts(t, n).

Possibly related PRs

  • elastic/beats#51091: Modifies the same celOTelConfig collector configuration in otel_cel_test.go that this PR updates.
  • elastic/beats#51428: Introduces the packetbeat OTel integration test suite in otel_test.go whose monitoring-port handling this PR refactors.

Suggested labels

bugfix

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@filebeat/input/net/tcp/input_test.go`:
- Line 106: The benchmark and related connection-refused tests still use
ephemeralTCPAddr, which closes the listener before inp.Run binds, leaving a
TOCTOU race on the chosen port. Replace this bind-then-close pattern in
TestInputRunTCP and the other affected tests with a setup that keeps the
listener reserved until the TCP input is ready to accept, or otherwise
coordinate the bind so serverAddr is not exposed prematurely. Use the existing
inp.Run and serverAddr call sites to update the test flow consistently across
this cohort.

In `@x-pack/filebeat/tests/integration/otel_test.go`:
- Around line 1487-1496: The ephemeral port selection in the mock Elasticsearch
setup reintroduces a TOCTOU race by closing the listener before the mock server
owns the port. Update the setup around the port reservation logic in the test
helpers that use createMockServer so the server keeps ownership of the bound
endpoint throughout its lifecycle. Prefer an always-bound mock endpoint and
toggle outage behavior at the handler/transport layer instead of releasing
serverPort early. Apply the same fix anywhere the same port reservation pattern
is duplicated in this test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: f585fc90-bb72-4ec9-af43-5b6638b3a806

📥 Commits

Reviewing files that changed from the base of the PR and between 7764586 and 0e6f94d.

📒 Files selected for processing (14)
  • filebeat/input/net/tcp/input_test.go
  • heartbeat/monitors/active/http/http_test.go
  • heartbeat/monitors/active/tcp/tcp_test.go
  • libbeat/testing/available_port.go
  • libbeat/testing/available_port_test.go
  • libbeat/tests/integration/framework.go
  • libbeat/tests/integration/monitoring.go
  • x-pack/filebeat/tests/integration/otel_cel_test.go
  • x-pack/filebeat/tests/integration/otel_kafka_test.go
  • x-pack/filebeat/tests/integration/otel_lsexporter_test.go
  • x-pack/filebeat/tests/integration/otel_test.go
  • x-pack/metricbeat/tests/integration/otel_test.go
  • x-pack/otel/oteltestcol/collector.go
  • x-pack/packetbeat/tests/integration/otel_test.go
💤 Files with no reviewable changes (5)
  • libbeat/testing/available_port_test.go
  • libbeat/testing/available_port.go
  • x-pack/filebeat/tests/integration/otel_cel_test.go
  • x-pack/filebeat/tests/integration/otel_kafka_test.go
  • x-pack/filebeat/tests/integration/otel_lsexporter_test.go

Comment thread filebeat/input/net/tcp/input_test.go
Comment thread x-pack/filebeat/tests/integration/otel_test.go
@orestisfl

Copy link
Copy Markdown
Contributor Author

@elastic/sec-linux-platform @elastic/obs-ds-hosted-services can i get a review please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches cleanup skip-changelog Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team Team:Security-Linux Platform Linux Platform Team in Security Solution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants