libbeat/testing: use ephemeral ports to avoid TOCTOU collisions#51617
libbeat/testing: use ephemeral ports to avoid TOCTOU collisions#51617orestisfl wants to merge 13 commits into
Conversation
Remove `libbeattesting.MustAvailableTCP4Port` and the related `AvailableTCP4Port`/`AvailableTCP4Ports`/`MustAvailableTCP4Ports` helpers (and their tests). WHY: these helpers found a "free" port by binding `:0`, reading the assigned port, then closing the listener and returning the number. That leaves a time-of-check/time-of-use window: the caller binds the port for real only later, and in between the OS can hand the just-released port to another process. Running the same test many times in parallel (e.g. via `script/stresstest.sh -p 30`) makes those cross-process collisions likely, producing flaky bind failures. WHAT: convert every caller to let the OS assign an ephemeral port at bind time and read the bound port back: - otel integration tests (filebeat/metricbeat/packetbeat) set `http.port: 0` and discover the resolved monitoring port from the "Metrics endpoint listening on:" log line. - New shared helpers do the discovery: `oteltestcol.Collector.MonitoringPort` /`MonitoringPorts` (from the collector's observed logs) and `integration.BeatProc.MonitoringPort` (from the beat process log file), both backed by `libbeat/testing.ParseMonitoringPort` so the log format lives in one place. - packetbeat's test HTTP servers bind ephemeral ports and the sniffer is configured from the resolved port. - the filebeat ES-outage test and the heartbeat connection-refused tests bind `:0` in-process (releasing for the connection-refused / restart cases), removing the early pre-allocation. Verified with `script/stresstest.sh -p 30` (~5 min) on the heartbeat connection-refused tests (0 failures over ~150k/~139k runs) and by running the filebeat/metricbeat/packetbeat otel E2E tests against Elasticsearch.
The test OTel collector defaulted to a Prometheus telemetry reader on a fixed localhost:8888, so two collectors could not run at once. Stress runs (stresstest.sh -p N) of any collector-based integration test therefore collided on that port. Disable the collector's own telemetry metrics by default in oteltestcol.New (service.telemetry.metrics.level: none) unless the test set a level; the tests assert on Elasticsearch documents and the per-receiver HTTP monitoring endpoint, not the collector's own metrics, so this is safe. Collector.Shutdown now blocks until the collector has fully exited so callers can restart a collector that reuses the same path.home without racing teardown. TestNoDuplicates previously polled :8888 to detect that the previous collector had exited before restarting; the blocking Shutdown replaces that, removing the last fixed-port dependency and letting these tests run in parallel.
Small follow-up cleanups with no behavior change: - oteltestcol.New: drop the unreachable `if err != nil` block after require.NoError and fold the WriteFile call into the assertion. - libbeat/testing.monitoring: build the endpoint regex from MonitoringEndpointSnippet via regexp.QuoteMeta so the literal lives in one place. - TestNoDuplicates: drop the redundant t.Cleanup after the collector restart; oteltestcol.New already registers shutdown via t.Cleanup.
oteltestcol.New now merges service.telemetry.metrics.level: none into every test collector config, so the per-config blocks that disabled the collector's own metrics (to avoid the fixed :8888 Prometheus port) are redundant. Remove them. Where a config only set the metrics level, the whole telemetry block is dropped; where it also set telemetry.logs.level, only the metrics entry is removed so the log level is preserved.
🤖 GitHub commentsJust comment with:
|
|
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services) |
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
📝 WalkthroughWalkthroughThe PR removes the Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@filebeat/input/net/tcp/input_test.go`:
- Line 106: The benchmark and related connection-refused tests still use
ephemeralTCPAddr, which closes the listener before inp.Run binds, leaving a
TOCTOU race on the chosen port. Replace this bind-then-close pattern in
TestInputRunTCP and the other affected tests with a setup that keeps the
listener reserved until the TCP input is ready to accept, or otherwise
coordinate the bind so serverAddr is not exposed prematurely. Use the existing
inp.Run and serverAddr call sites to update the test flow consistently across
this cohort.
In `@x-pack/filebeat/tests/integration/otel_test.go`:
- Around line 1487-1496: The ephemeral port selection in the mock Elasticsearch
setup reintroduces a TOCTOU race by closing the listener before the mock server
owns the port. Update the setup around the port reservation logic in the test
helpers that use createMockServer so the server keeps ownership of the bound
endpoint throughout its lifecycle. Prefer an always-bound mock endpoint and
toggle outage behavior at the handler/transport layer instead of releasing
serverPort early. Apply the same fix anywhere the same port reservation pattern
is duplicated in this test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: f585fc90-bb72-4ec9-af43-5b6638b3a806
📒 Files selected for processing (14)
filebeat/input/net/tcp/input_test.goheartbeat/monitors/active/http/http_test.goheartbeat/monitors/active/tcp/tcp_test.golibbeat/testing/available_port.golibbeat/testing/available_port_test.golibbeat/tests/integration/framework.golibbeat/tests/integration/monitoring.gox-pack/filebeat/tests/integration/otel_cel_test.gox-pack/filebeat/tests/integration/otel_kafka_test.gox-pack/filebeat/tests/integration/otel_lsexporter_test.gox-pack/filebeat/tests/integration/otel_test.gox-pack/metricbeat/tests/integration/otel_test.gox-pack/otel/oteltestcol/collector.gox-pack/packetbeat/tests/integration/otel_test.go
💤 Files with no reviewable changes (5)
- libbeat/testing/available_port_test.go
- libbeat/testing/available_port.go
- x-pack/filebeat/tests/integration/otel_cel_test.go
- x-pack/filebeat/tests/integration/otel_kafka_test.go
- x-pack/filebeat/tests/integration/otel_lsexporter_test.go
|
@elastic/sec-linux-platform @elastic/obs-ds-hosted-services can i get a review please? |
Proposed commit message
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration filesstresstest.shscript to run them under stress conditions and race detector to verify their stability.[ ] I have added an entry in— test-only change,./changelog/fragmentsskip-changelogapplied.Disruptive User Impact
None. The change is limited to test code and helpers.
How to test this PR locally
The flakiness this fixes only shows up under parallel stress, so verify both the unit and integration tests stay green under load.
Unit tests
Both ran ~5 minutes with 0 failures (~150k and ~139k runs respectively).
Integration tests (OTel collectors)
The collectors no longer bind any fixed port, so many can run concurrently against a single Elasticsearch. Bring Elasticsearch up once, then stress each beat's OTel tests in parallel: